Abstract
Free full text
In-Silico Approaches for the Screening and Discovery of Broad-Spectrum Marine Natural Product Antiviral Agents Against Coronaviruses
Abstract
The urgent need for SARS-CoV-2 controls has led to a reassessment of approaches to identify and develop natural product inhibitors of zoonotic, highly virulent, and rapidly emerging viruses. There are yet no clinically approved broad-spectrum antivirals available for beta-coronaviruses. Discovery pipelines for pan-virus medications against a broad range of betacoronaviruses are therefore a priority. A variety of marine natural product (MNP) small molecules have shown inhibitory activity against viral species. Access to large data caches of small molecule structural information is vital to finding new pharmaceuticals. Increasingly, molecular docking simulations are being used to narrow the space of possibilities and generate drug leads. Combining in-silico methods, augmented by metaheuristic optimization and machine learning (ML) allows the generation of hits from within a virtual MNP library to narrow screens for novel targets against coronaviruses. In this review article, we explore current insights and techniques that can be leveraged to generate broad-spectrum antivirals against betacoronaviruses using in-silico optimization and ML. ML approaches are capable of simultaneously evaluating different features for predicting inhibitory activity. Many also provide a semi-quantitative measure of feature relevance and can guide in selecting a subset of features relevant for inhibition of SARS-CoV-2.
Plain Language Summary
Coronaviruses (CoVs) are a family of viruses that cause lung and intestinal illnesses in humans and animals. Generally, these are mild illnesses characterized by cold-like symptoms. However, viruses that cause severe disease have emerged over the past twenty years, first with the severe acute respiratory syndrome (SARS) epidemic in China in 2002–2003 and subsequently the Middle East respiratory syndrome (MERS) on the Arabian Peninsula in 2012. The novel coronavirus that emerged in Wuhan at the tail end of 2019 has to date killed >6.01 million people (JHU-CSSE, 2022) and collectively cost the world’s economy >16 trillion dollars. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative virus responsible for this coronavirus disease (COVID-19). The urgent need to control this family of viruses has led to a reassessment of approaches to identify and new antiviral drugs. One area of investigation has been marine natural products, compounds or substance produced by living organisms present in the marine environment. Access to collections of these natural products is vital to finding new pharmaceuticals to combat coronavirus diseases. Increasingly, advances in artificial intelligence are aiding this drug discovery process.
Introduction
The novel coronavirus that emerged in Wuhan at the tail end of 2019 has to date killed >6.01 million people (JHU-CSSE, 2022) and collectively cost the world’s economy >16 trillion dollars.1 Despite the massive research response to the pandemic,2 there are yet no clinically approved broad-spectrum antivirals available for betacoronaviruses. The difficulty with relying solely on immunological agents for protection against coronavirus, is the high mutation rate and loss of epitope specificity between family members and variants of the same species of coronavirus.3 This means that, in the event of a future outbreak of coronavirus, a bespoke vaccine must be developed and pass regulatory testing before it can be used to prevent the spread of the disease and treat those infected.4
Discovery pipelines for pan-virus medications against a broad range of betacoronaviruses are therefore a priority for preventing high mortality rates in future outbreaks.5 Natural metabolites of plants, fungi and bacteria have long been known to have antiviral activity.6 These organisms lack the complex adaptive immune system of animals and rely on the production of broad-spectrum, small molecule inhibitors to keep pathogenic viruses at bay.7 Currently, however, the focus of drug discovery in relation to coronaviruses focuses on the repurposing of already characterized pharmaceuticals or using existing pharmaceuticals as structural leads.8–11 Existing drug structural information is, however, a much smaller sample space than the population of known molecular structures, with only around 12,000 characterized members on the online database DrugBank (as of March 2022), accounting for only about 0.5% of structures available off ChEMBL. Conversely, there are currently around 400,000 natural products, which are curated in online databases and are accessible for molecular docking simulations.12 A variety of NP small molecules isolated from photosynthetic algae (eg, phlorotannins, sulfated polysaccharides), marine bacteria (e.g. lactones), and sponges (e.g., nucleosides, sesquiterpene hydroquinones, cyclic depsipeptides, alkaloids, etc.) have shown inhibitory activity against viral species including human immunodeficiency virus-1 (HIV-1), HCV, influenza, and herpes simplex virus.13 Examination of marine natural products from algae that are active against SARS-CoV-2 proteins presents a novel and viable approach.
Naturally occurring bioactive compounds represent a viable approach for the development of antiviral agents. Flavonoids, for example, exhibit broad antiviral and immunomodulatory activities against coronaviruses. Flavonoids are key secondary plant metabolites that have been the subject of much study for their therapeutic potential in inflammatory diseases owing to their cytokine-modulatory effects. The antiviral activity of flavonoids is realized via enzymatic inhibition of the 3C-like protease (3CLpro) the primary protease found in coronaviruses. Recently, five compounds obtained from Camellia reticulata and Anastatica hierochuntica (plants) and Kermia aegyptiaca (a marine gastropod mollusk), namely taxifolin, pectolinarigenin, tangeretin, gardenin B, and hispidulin, were examined for activity against SARS-CoV-2 and represent promising candidates for, for COVID-19 management.14 In a separate study thirty-three focused marine NPs related to the pederins, mycalamides, onnamides and theopederins polyketide families were assessed using computational approaches including molecular docking and molecular dynamics simulations studies for their affinity for the dimeric form of 3Clpro. This revealed that the majority of the marine NPs examined had favorable binding scores, in particular dihydro-onnamide A, onnamide C, and pseudo-onnamide A.15
Access to large data caches of small molecule structural information is vital to finding new pharmaceuticals. This is because, in general, in-vitro-only de novo drug discovery, reliant on vast chemical libraries and expensive and extensive robotics, is a laborious and low yielding strategy.16,17 Increasingly, in the age of big data and deep learning, molecular docking simulations are being used to narrow the space of possibilities and generate drug leads.11,18,19 It would therefore be worthwhile to combine in-silico methods, augmented by metaheuristic optimization and machine learning (ML), to generate hits from within a virtual library of natural products in order to narrow down high throughput in vitro screens for new drug targets against coronaviruses. In this article, we will explore current insights and techniques that can be leveraged to generate broad-spectrum antivirals against betacoronaviruses using in-silico optimization and ML.
Drug Targets
Members of the genus betacoronaviridae are enveloped (surrounded by a bilipid membrane) and positive sense RNA viruses (which means the genome can be used directly in translation), that share a unique lifecycle and 50–80% sequence homology (between SARS-CoV, SARS-CoV-2 and MERS-CoV).20 Of the 29 open reading frames in its small genome, four encode for structural proteins: S (Spike), important for host recognition and attachment, M (membrane) and E (envelope), which mediate bilipid fusion in entry and release of the virus, and N (nucleocapsid), which forms the protective protein shell around the viral genome and mediates assembly of the final virion after replication.21 Structural proteins are essential to the entry and assembly of viruses and can therefore be targeted by antiviral therapies (Figure 1). Despite the important role of these proteins in the viral life cycle and their utility as targets for extracellular drug therapy, M, E and N proteins are nonetheless unattractive targets for the development of a broad-spectrum antiviral due to the high variability in protein sequence across the phylum.21,22 On the other hand, the ACE2 binding domain of the S protein, as determined by a team at Tsinghua University, Beijing, shows high structural conservation across the family of coronaviruses most related to SARS-CoV and SARS-CoV-2, making it a promising target for inhibitory binding.23
Due to the fact that structural proteins are most exposed to host immune recognition and are therefore subject to stronger selection pressure, intracellular viral proteins are far more conserved.22 The most important non-structural proteins (NSP) to the betacoronavirus life cycle are an RNA dependent RNA polymerase (RdRp) and two selective proteases, a serine-type protease (MPRO) and a papaine like protease (PLPRO).22 RdRp is essential for the reproduction of all non-retroviral RNA viruses that infect animals (animals do not express endogenous RdRp) and many that infect plants, and in positive sense RNA viruses it is used to first create a template negative strand and then replicate the genome aided by an RNA helicase.24 The RdRp protein is highly conserved in betacoronavirus with similarity of 96% between SARS-CoV and SARS-CoV-2, and 70% between MERS-CoV and SARS-CoV/SARS-CoV-2.25 The genome of most plant viruses are single positive strand RNA, which means there are potentially many phytochemical inhibitors of RdRp available from natural sources.8,26
Proteases MPRO and PLPRO are essential for post-translational processing in the viruses replicative cycle.22 Figure 2 shows the despite a low percentage identity between PLPRO protein sequences in SARS and MERS coronaviruses (30–28%), structure-based multiple protein alignment revealed a conserved homology in the proteolytic active site (core root mean square deviation (RMSD) of 1.47). A further search of this conserved sequence on the Conserved Domain Database, showed that the catalytic domain preserves the cysteine protease catalytic triad (Cys, His, Asp) even with the most diverse members of the group.27 Many plant viruses use a papain-like cysteine protease for reproductive cleavage28 which indicates there may be phytochemical inhibitors of PLPRO.29 The MPRO protease is similarly a popular target of competitive inhibition due to its conserved serine-like proteolytic domain.27
2‘-O-MTase protein is another promising betacoronavirus target, which binds to S-adenosylmethionine (SAM) to facilitate 2‘-O-ribose methylation. It exists as a heterodimer made of two non-structural protein units, ie NSP16 and NSP10. NSP10 assists in the stabilization of NSP16 and SAM binding sites.22,30 A previous study has shown that the NSP16 and NSP10 protein sequence is highly conserved in SARS-CoV and SARS-CoV-2 with similarity values of 99.7% and 99.3%, respectively. In the case of MERS-CoV, the similarity with SARS-CoV/SARS-CoV-2 is lower at 53.9% and 63.8%, for NSP16 and NSP10, respectively. Despite the low similarity between SARS-CoV/SARS-CoV-2 and Mers-CoV, the 2‘-O-MTase protein is the highly conserved SAM binding pocket with similar interacting residues in betacoronavirus.31
The highly conserved binding pocket of the betacoronavirus protein targets (such as in MPRO, PLPRO, 2‘-O-MTase, and RdRp; Figure 3) opens up the potential for discovering a compound, which can bind effectively to a similar target protein of different species or variants from this genus to inhibit the protein’s activity. This strategy paves the way for the development of a broad-spectrum antiviral agent as a treatment for coronavirus infection.
Three strategies currently exist for developing new drugs targeted against SARS-CoV-2. The first is focused on existing broad-spectrum anti-virals including interferons, ribavirin, and cyclophilin inhibitors, which are all employed to treat pneumonia caused by coronavirus. The limitations with these are that they are too “broad-spectrum” and cannot neutralize coronaviruses in a targeted fashion. The second approach exploits existing molecular databases to screen for molecules that may have therapeutic effects on coronavirus. The third strategy leverages genomic information and the pathological characteristics of different coronaviruses to develop novel targeted drugs. Theoretically, these therapeutics would exhibit better anti-coronavirus effects, however the research and development required could take greater than a decade.32
Molecular Docking Algorithms
Molecular docking is a widely used technique to determine a preliminary assessment of how well a ligand interacts with a drug target.33,34 The general approach of such algorithms is to use a predetermined structure of the macromolecular target (generally formatted as a PDB file) and 1) subdivide the entire structure or predetermined binding pockets into a search grid, 2) randomly place the ligand structure inside the grid, 3) assign the binding energy of the ligand-target pose based on a scoring function, 4) repeat steps 2 +3 exhaustively or until a threshold is achieved using heuristic optimization and 5) report this threshold. Several approaches can be used to shrink the search space as a non-parametric examination, otherwise it would become too computationally expensive for a library search.33 The number of poses that must be calculated is proportional to the size of the assigned grid and this dimension can usually be reduced by knowing predetermined binding sites for substrates (for example, in order to inhibit the S protein an inhibitor must bind to the ACE2 binding site).23 If important sites are not known or if allosteric binding is desired, “hot spots” can be determined before the search, by using cavity search algorithms (such as the one used by MolDock)34 or by using small molecular fragment probes to search to entire protein surface (small molecules have fewer poses and have, theoretically, better binding energies than more complex ligands).35 BetaCoV PLPRO has 3 conserved sites: a peptide binding site/catalytic domain, a zinc binding site and an ubiquitin-binding domain.27 All three appear to be important for the function of this protease and the life cycle of the virus.29 Establishing the hotspots of these domains and the drugability of each binding site might be beneficial for guiding a library search. Figure 4 shows three binding hotspots found by Fragment Hotspot Maps,35 showing overlap with all three conserved domains.
Although systematic, exhaustive search strategies are sometimes employed by molecular dynamics (MD) algorithms, heuristic methods are preferential with ligands with more degrees of freedom.33 Different meta-heuristics are used to explore the search space efficiently, including a MonteCarlo approach (SMINA and ISO software), genetic algorithms (DockThor, GAsDock and GOLD) and particle swarm optimization (Plants). In general, Monte Carlo approaches randomly sample poses, identify clusters of high scoring poses and iteratively search inside the clusters for optimums.36 This algorithm works excellently for finding global optima while avoiding “trapping” in local optima, however it requires a large amount of sampling, scaling exponentially with the degrees of freedom of the ligand. Genetic algorithms (the most common approach) improve on Monte Carlo based approaches by reducing the amount of stochastic sampling. These algorithms 1) select the highest scoring poses in a random sampling, 2) “mutate” the samples by randomly altering their pose, 3) shuffle the bond rotations and average the position between two “mating” pairs (also called recombination), 4) retest, select the highest scoring poses and repeat 2–4 until a local maximum is found or a pre-set number of generations have elapsed.37
Database Searching
There are around 400,000 structures in the COlleCtion of Open Natural ProdUcTs (COCONUT),12 which means that an iterative search for a new inhibitor of a target protein, if each search took 2 minutes, would take 91 years and 2 months to complete (of course only if searches were run consecutively). Even with concurrent searches, iterative strategies are computationally expensive and would require a reduction of accuracy by using simpler approximations.33
A popular non-systematic approach is to use a fragment library, made from a library of structures, to iteratively build an “ideal inhibitor”. This approach can use a genetic algorithm or a deterministic meta-heuristic model, such as simulated annealing to build a stochastically optimized molecule.38,39 Each candidate can then be used to search the library of natural products for “real candidates” based on 3D similarity. Supplementary Table S1 provides a summation of the metrics that can be used to compare an ideal molecule to a library of real molecules. Based on these metrics, the top candidates from the library can then be tested by molecular docking in a small number of similar poses. This process can be repeated until a list of high scoring molecules are available.39
Alternatively, a meta-heuristic approach can be used to search an organized library. We note that the similarity scores based on metrics from Supplementary Table S1 and information from Table 1 can be used to arrange the library of molecules as nodes in n-dimensional space with edge weights being proportional to 3 dimensional similarity (Supplementary Table S1) and similarity of attributes (Table 2; Figure 5A).40
Table 1
Molecular Attributes | Description | Parameters and Assumptions |
---|---|---|
Quantitative Structure Activity Relationship, QSAR96 | Based on Lasso Regression of attributes of characterized drugs versus their quantified activity. | Attribute list can be generated based on observation or generated by machine learning. |
Differential solubility97 | LogP = log base 10 of partition coefficient in a n-octanol- water system. Log P must not differ much from zero for differential solubility to be high. DS = |logP|Where |x| is the absolute value. | Assumption that drug must be taken up (hydrophilicity) and diffuse across barriers (lipophilicity). |
Diffusability97 | Inversely proportional to molecular weight. | Assumption that molecular size is inversely proportional to uptake by cells or diffusion across biological barriers. |
Table 2
Database | Marine Compounds | Link | Note |
---|---|---|---|
Seaweed Metabolite Database | 1,110 | https://www.swmd.co.in/ | Only from algae. |
MarinChem3D | 30,117 | http://mc3d.qnlm.ac/ | All marine compounds. |
Comprehensive Marine Natural Products Database | 32,000 | https://www.cmnpd.org/ | All marine compounds. |
Dictionary of Marine Natural Products | http://dmnp.chemnetbase.com/faces/chemical/ChemicalSearch.xhtml | Paid access database. | |
MarinLit | 35,790 | http://pubs.rsc.org/marinlit | Paid access database. |
MetaboLights | Not specified | https://www.ebi.ac.uk/metabolights/index | Generic Metabolites database, it is possible to select single species. |
We propose that this network be explored by a particle swarm algorithm (Figure 5B and andC)C) where only local optima are recorded to avoid over-representing structural homologues and a tabu list be maintained to ensure that the same molecule is not re-tested.41 After a list of high scoring molecules is extracted from the organized library, molecular docking of homologues surrounding the optimum in the network can be used to extract a pharmacore or to identify plant extracts for in-vitro validation.
In-Vitro Validation
The main weakness of in-silico drug design is the unrealistic approximations that must be used to shrink the computational requirements of a given search. Often, quantum approximations and empirical scoring functions are used to simplify that calculation of binding affinities. Furthermore, receptor flexibility is generally not allowed in docking simulations due to the added search space entailed in moving both the ligand and the target.42 As a result, in-silico evaluation is best suited to a “hit to lead” approach, where a wide list of potential leads (termed “hits”) are assayed in vitro to narrow down the “true leads”.10,11,18,19,21,29,33,39,43
Selected natural extracts from in-silico analysis can then undergo high throughput screening for inhibitor activity. Techniques for finding PLPRO or MPRO inhibitors can utilize the proteolytic rate constant as an indicator of activity,44 by for example, using fluorescent polarization of fluorescently labelled peptides.45 Alternatively, a technique such as surface plasmon resonance can be used to directly measure KD dissociation constants.11,33 Once an extract is confirmed as being inhibitory, chromatographic fractions can be tested in another round of high throughput screening to identify the molecular formula of the inhibitor.29
In 2021, this group proposed that in-silico drug discovery could be exploited to generate a series of hits that would be traced back to natural product extracts, and that subsequently these extracts would be assayed using a mass spectrometry technique that allows the detection of ligands via mass signatures. This type of high throughput workflow allows for the rapid creation of natural, antiviral drug leads and could effectively lead to a broad-spectrum prophylactic/therapeutic for future betacoronavirus pathogens.46
Machine Learning and Drug Discovery
The terms machine learning (ML), artificial intelligence (AI) and deep learning (DL) are often used interchangeably but have distinctly different meanings, AI is the field of computer science that simulate intelligent behavior utilizing their environment as input. ML is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, recognize behaviors, and make decisions with minimal human intervention,47 while DL are the implementation of representation-learning methods with multiple levels of representation,48 being then a subset of ML and, it is commonly found implemented on neuronal networks (Figure 6). ML Is not a new term, it was coined in 1959 by Arthur Samuel an IBM engineer describing a checkers game.49 The capabilities of ML algorithms have matured in the past 20 years as computers and computational resources have gained in sophistication. Historically, the utility of ML algorithms has been limited by data availability and computational power. Since the 2000s ML tools have become more accessible, with several open-source libraries providing powerful and accessible methods. These libraries have enabled faster and wider application and implementation of ML algorithms to various tasks. The recent uptake in ML has seen it applied to a variety of tasks, including stochastic prediction and data classification.50 ML models have proven useful on a myriad of data types and tasks.
A range of open-source ML libraries available in various programming languages have been developed. The most common is Python, which enables the use of Scikit-learn,51 Keras,52 TensorFlow,42 Facebook AI Similarity Search (Faiss)53 and PyTorch.54 Their aptitude for pattern recognition has made these libraries useful in biomedical applications including neuroimaging, general pathology, and protein-folding.55–57 Just as there are many applications for ML, there are also many different machine models to choose from when performing a task. Types of ML can be broadly and not strictly categorised by their functionality, such as classification, prediction, or clustering or by the degree of user input, such as supervised, semi-supervised and unsupervised.58,59
ML algorithms do not always provide the optimal solution to a task as some data sets are simply not large enough to train an algorithm sufficiently. Sometimes, it is not feasible to analyze all the possible patterns one by one, in this cases, instead of getting an absolute deterministic solution, the ML algorithm will apply shortcuts to reduce the space search area and provide a good enough solution, like in the case of heuristics.60 Choosing an algorithm(s) for a given application is also an exhaustive process, due to the fact that no algorithm will be better than the other in every application and it is highly dependent on the particular case. Once the algorithm is selected, it is crucial to select the model that better generalizes the data,60 one way to do so is the tuning of the parameters. The influence of the refining can introduce bias if not done carefully.
In addition to these selecting and refining steps the hyperparameters of the algorithms can further introduce bias. The field of automated ML (auto ML) offers a potential solution to these problems. It utilizes ML techniques to automate data selection and preparation, model selection and hyperparameter optimisation, potentially reducing bias and allowing effective comparisons exploiting many algorithms and configurations.61 ML algorithms can result in overfitting, where the models produced fit too closely to the training data and are not representative of the reality. ML practitioners implement techniques such as the holdout method and cross validation to reduce the risk of overfitting.62–64 ML models are further limited by the available data, the quality of data pre-processing, selection, or reduction, the hyperparameters chosen and time constraints.
A common limitation when implementing ML, especially when considering biological data, is the high dimensionality of them and a low number of observations. Too many features can exponentially increase the computational strain involved in training, testing, and implementation. An excess of features can also reduce the effectiveness of an algorithm by introducing noise. To reduce the number of features, feature selection, and feature reduction techniques can be implemented. Feature selection aims to extract the most relevant features from a dataset.65 This can be manual or automatic, with many of the automatic feature selection methods relying on statistical methods. Feature extraction aims to reduce the number of features by using the existing features to create new ones, lower in number and more representative, and discarding the old ones. Many feature extraction methods are in themselves unsupervised ML algorithms, examples include principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP) and independent component analysis (ICA) (18).
In supervised learning, data is divided into training and test datasets. The performance of algorithms is assessed using the test data, which can inform changes to the hyperparameters and features. Testing data cannot be used to improve the model performance.50 In the drug discovery field ML algorithms enable searches of large databases for potential therapeutic compounds of interest, the results of these algorithms are then used to inform further studies, streamlining searches for potential drugs. In drug discovery, there are various possible features, which are often categorised by the dimension they describe. For example, 0D descriptors include molecular weight and counts of specific atom types such as heavy atoms. 1D descriptors describe the 1-dimensional features of a molecule such as acetyl or hydroxyl functional groups; 2-dimensional descriptors include topological features such as polarity number and Wiener index; 3-dimensional descriptors include geometrical molecular descriptors and steric properties.66 Other descriptors include stats on their simplified molecular-input line-entry system (SMILES).67–69
One of the first tasks when developing an ML algorithm for drug discovery is determining which combination of descriptors work most effectively with given datasets. These data are available for both ligands and the receptors they bind to, to train the ML algorithm(s) to find any underlying patterns between which molecules are likely to bind to each other, and then to test the efficacy of the trained algorithm using methods such as cross validation. ML algorithms have found applications alongside classical approaches from the early stages involving screening of compounds libraries, to the later phases including clinical trials, enabling improvements in the accuracy and speed of the drug discovery process.
One key factor used to identify suitable candidates is the estimation of the pharmacokinetics properties, ie, absorption, distribution, metabolism, excretion, and toxicity (ADMET). ADMET provides important insights into the behaviours of the compound in the living organism, including bioavailability and toxicity.70 Since the high complexity of biological organisms and chemical reactions, the production of a mathematical model capable of correctly taking properties represents a very challenging problem. Quantitative structure–activity relationship (QSAR) models, that try to relate chemical data to biological properties, and propose mechanisms of interaction among compounds and the target protein, were widely used to estimate ADMET properties. Both techniques, usually based on molecular descriptors (including fingerprint, 2D and 3D descriptors and steric parameters), have benefited from the introduction of ML approaches. Supervised ML methods predict single or multiple properties of the selected compounds from independent variables (single and multitask models) to reduce the number of experimental failures. Unsupervised ML models using unlabeled data, can be used to identify new scaffolds (scaffold-hopping) or compounds with query-like properties.70,71 Several ML algorithms are in use in QSAR and ADMET estimations, including k-nearest neighbour, random forest (RF), support vector machines (SVM), principal component analysis and deep networks, to both perform properties predictions and similarity analysis to obtain compounds with similar features from large datasets.67,70
Multitask Deep Neural Networks (DNNs) are used to infer small-molecules properties and activities and make predictions about the readout of a molecule in a new experimental setup.72,73 These methods significantly boosted the classical statistical methods used before, due to the fact that they are faster and can efficiently predict relevant pharmacological parameters from large datasets, reducing the number of compounds to be tested in virtual screening (VS). The scoring functions in molecular docking programs must be able to position the ligands in the best available pose (docking power). Subsequently, the scoring function must correctly estimate the binding affinities based on the poses obtained (scoring power). Furthermore, it must classify the different compounds as good or bad ligands according to the obtained poses (ranking power).71 The classical scoring functions are based on force-fields, empirical-based, or knowledge-based. Here, ML approaches are used to improve the accuracy of binding energy predictions in various ways: SVM and RF trained on ligand-protein complexes described as geometrical features or chemical descriptors, RF-score based on features deriving from different docking programs and SVM to predict IC50 of protein inhibitors.71 Also, given a set of bound ligands to a target protein, the ranking power should correctly find the top hits among ligand poses. Here, for each experimentally obtained protein-ligand complex and relative binding affinity, the features from different scoring functions have been retrieved, the set of ligand-protein featured has been used to train 6 ML models able to rank the complexes and predict the highest, median and lowest binding affinity.71 Also, a non-parametric ML approach was proposed to build target-specific scoring functions, this method is considered useful in lead optimization and prediction of the best novel ligands. The docking power identifies the best binding pose of a ligand given a set of poses. The most stable complexes reside in the minimum of energy; different methods to search in the space are used to reach these configurations: Steepest descendant optimization, simulated annealing, stochastic approaches (Monte Carlo) and Molecular Dynamics. In this field SVM and RF insensitive to the docking pose accuracy has been proposed.71 Further improvements in ML approaches in docking power will improve the effectiveness and accuracy of docking simulations. Finally, the screening power is the ability to identify molecules that can effectively bind the target among random molecules. Here, scaffold hopping techniques are used to identify novel ligands based on similar chemical positioning. In ML aided VS approaches, an interesting procedure was used to build a discriminating SVM model, based on extended-connectivity fingerprints (ECFPs) and physical properties, to discern putative inhibitors of c-Met to other compounds from large libraries, the virtual screening carried only on the putative inhibitors revealed the better performance and more accurate compound selection for downstream analysis.74
ML has also been applied to compound synthesis after dereplication. Reaction condition recommendation remains an indispensable aspect for achieving computer-assisted synthesis. Accurate reaction conditions are needed for experimental validation and exert a significant effect on the success or failure of an attempted transformation. De novo condition selection has traditionally relied on chemists’ background knowledge and expertise. Neural-network models are being exploited to predict the chemical context (ie, catalyst(s), solvent(s), reagent(s)), in addition to the optimal temperature for any given organic reaction.75,76 Pairing chemistry and ML, via data-driven analyses, neural network predictions and monitoring of chemical systems, is adding to (i) our understanding of the complexity of chemical data and (ii) experimental design and streamlining.76
Integration of ML in Screening of Marine Natural Products
Examination of marine compounds active against SARS-CoV-2 proteins via virtual screening with ML-integrated approaches presents a novel and viable approach. Firstly, we need to consider available databases containing marine compounds to screen. Genomic databases could be considered in the future if a family of related compounds are identified. An example of how this might be achieved for natural products, ie, identifying biosynthetic signatures in genomic data, predictions of what structures will be created from those genomic signatures, and the types of activity one might expect from those molecules is reviewed by Prihoda et al.77
Today, available resources for marine compounds are limited to five commercial or free of charge databases of marine natural compounds. The two commercial databases are MarinLit78 and the Dictionary of Marine Natural Products.79 Both databases are the most comprehensive marine natural compounds databases available, containing data from publications, synthesis, organisms, and biological activities.80 According to MarinLit, the database contains >35,000 articles. The open databases available are highlighted in Table 2.
The Seaweed Metabolite Database (SWMD) contains 1110 compounds from Brown, Green and Red algae.81
MarinChem3D contains more than 30,000 compounds with defined 3D structures and molecular descriptors.82
The Comprehensive Marine Natural Products Database (CMNPD) contains more than 32,000 marine compounds with various physicochemical and pharmacokinetic properties and biological activity data.38
In addition to these databases, MetaboLights83 for metabolomics experiments, contains compounds from many sources, reducing the research only to the species of interest. Structural databases including ChEMBL, ZINC,84 PubChem,84 DrugBank85 and ChemSpider86 include several molecules including already patented drugs, natural compounds, bioactive compounds and related properties. In addition, MoleculeNet87 is a benchmark designed for testing ML methods of molecular properties. It contains features for >700,000 molecules. MoleculeNet, developed the python library DeepChem,87 containing several ML and DL algorithms, and tools to retrieve descriptors and allow preprocessing of data. The direct connection of DeepChem to the MoleculeNet datasets, including toxicology, solubility and biological activities makes DeepChem an important tool in ML aided drug-design. Considering retrieving data from one of the two largest free databases (MarinChem3D or CMNPD), a dataset of >30,000 compounds will be available that can reach higher numbers if added with inhibitors of the SARS-CoV-2 target-proteins (ie, 159 for the 3CLpro), according to ChEMBL, ID: CHEMBL3927 (Figure 7).
To reduce virtual screening times, and the number of false hits due to compounds with bad pharmacokinetic properties or toxic effects, integrating supervised and unsupervised ML approaches offers an advantage. Screens can focus on those compounds that demonstrate structural or pharmacokinetic similarities with already known drugs. To focus research on new molecular scaffolds with suitable pharmacophoric properties, a multitask DNN could be utilized to learn a model predicting multiple pharmacokinetics properties (including absorption, distribution, metabolism, excretion, and toxicity) from physicochemical properties.70,72
The prediction of important pharmacokinetic properties will reduce the number of molecules to be screened and focus the analysis on those compounds with optimal pharmacophoric characteristics, reducing the risk of finding high affinity but unsuitable compounds. Furthermore, the model trained based on physicochemical properties will allow research unaffected by pre-existing structural patterns and scaffolds. On the other hand, including already known inhibitors and descriptors (pharmacophore and structural fingerprints) allows for unsupervised approaches to group the datasets based on the similarity among the compounds and the inhibitors.69 A fuzzy clustering will provide overlapping clusters, the molecules can be classified in multiple clusters, to better represent the similarity relationships in the dataset. The selection of the clusters containing the inhibitors will make the virtual screening focused only on molecules that share similarities and, probably, similar binding affinities.
Datasets of protein-ligand complexes, obtained from PDB (https://www.rcsb.org/), can be used to train ML models to better estimate ligand–protein interactions, binding energies and rankings, improving performance and accuracy of the molecular docking analysis.71 Once obtained the top hits from the molecular docking process, ligand-protein complexes should be validated using MD simulations. These analyses are based on force fields to approximate the intra- and inter-molecular interactions among solvent, protein, and ligand. Despite the high computational costs, MD makes it possible to study numerous factors such as optimization of the poses, the stability of the complexes, interactions, and binding energies. Since MD simulations require high computational costs, ML approaches have been proposed to replace or work with MD simulations based on force-fields. Starting from a dataset containing well-defined ligand and protein sequences, added with the fingerprints relative to the ligands, some ML models have been trained to predict the binding pose and energy of the ligand-protein complexes.88–90
These models are designed to work in the very initial stages of the drug-design process, the robustness highly depends on the variety and quality of the training datasets. Similar ML approaches can be used to reduce the number of compounds to be screened and eventually, depending on the robustness of the model, be used to skip docking and MD simulations. Also, several ML approaches have been proposed to enhance the speed and accuracy of force-fields and scoring functions, cutting the time required for MD simulations.91–93 These ML models have been trained on 3D properties (X-Ray, ab-initio MD) and features deriving from already existing scoring functions. Finally, to reduce the number of MD runs needed to retrieve the best pose, a probabilistic ML model based on the Best Arm Identification (BAI) has been proposed.94 This approach does not replace or enhance the MD analyses but selects the most promising poses to reduce the number of runs and therefore the calculation costs where these are limited. Figure 8 summarizes the options for integrating ML approaches into the drug-design workflow. The choice of the optimal route depends on the available data (number of compounds to be screened, known inhibitors, available features), characteristics of the target proteins (similarity to other known viral proteins, the distinctiveness of the active site and finally the computational resources).
Information captured for different natural product including spectra (eg, MS, NMR, etc.), binding affinities with protease targets, presence/absence component molecules and their respective anti-viral activity, are potentially relevant features for discovering natural products with a high probability of SARS-CoV-2 inhibition. Although such features when considered simultaneously may provide insight into candidates with the greatest likelihood of success for inhibition. Furthermore, identification of key features or combinations of features likely to have strong inhibitory effects can be used in screening newly discovered natural products.
A recent success was a study aiming to discover potential inhibitor(s) of Transmembrane protease, serine 2 (TMPRSS2) via virtual screening against a homology model of TMPRSS2 using the library of marine natural products (MNPs).95 Molecular docking, binding affinity analysis using MM-GBSA and ADME evaluations were carried out to explore the inhibitory activity of MNPs against TMPRSS2 and determine their pharmacokinetic properties. Seven MNPs inhibited TMPRSS2 and one in particular, MNP-10 or Watasenia β-D- Preluciferyl glucopyrasoiuronic acid, was the optimal inhibitor of TMPRSS2 with acceptable pharmacokinetic properties. This MNP holds promise as a novel TMPRSS2 blocker to combat SARS-CoV-2.
Conclusion
Viral structural proteins are most vulnerable to the host immune defense and are therefore subject to stronger selection pressure and mutate more frequently than intracellular viral proteins. From the perspective of therapeutic intervention the fact that intracellular viral proteins are better conserved makes them better targets for broad-spectrum therapeutic intervention. Molecular docking determines the interaction of a ligand with a drug target. The powerful computational ability afforded by molecular docking, ML and databases of marine natural products will help the identification of broad-spectrum inhibitors for SARS-CoV-2 and other betacoronaviruses. ML algorithms facilitate searches of large databases for potential therapeutic compounds of interest, the results subsequently inform further studies, and streamlining searches for potential drugs. The natural world is likely to be a source of useful small molecule antivirals against RNA viruses like coronaviruses that utilize papaine-like proteases and RdRp for replication. In-vitro-only high throughput screening of natural product extracts is an inefficient and expensive strategy. In-silico augmentation of drug discovery is at the cutting edge of SARS-CoV research and pharmaceutical research in general. ML algorithms have found applications alongside traditional approaches from the initial stages involving compounds library screening, to the later phases including clinical trials, enabling improvements in the accuracy and speed of the drug discovery process.
Disclosure
BJW, MH and GH report grants from the National Institutes of Health during the conduct of the study. G.H acknowledges support from National Institute of Health (NIH) U54MD010706, U01DA045300 and QUB start-up funds. G.H. is a founder of Altomics Datamation Ltd. and a member of its scientific advisory board. The authors report no other conflicts of interest in this work.
References
Articles from Infection and Drug Resistance are provided here courtesy of Dove Press
Full text links
Read article at publisher's site: https://doi.org/10.2147/idr.s395203
Read article for free, from open access legal sources, via Unpaywall: https://www.dovepress.com/getfile.php?fileID=89082
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Discover the attention surrounding your research
https://www.altmetric.com/details/146129318
Article citations
Manzamine-A Alters In Vitro Calvarial Osteoclast Function.
J Nat Prod, 87(3):560-566, 21 Feb 2024
Cited by: 0 articles | PMID: 38383319 | PMCID: PMC11173362
Antiherpetic Activity of a Root Exudate from Solanum lycopersicum.
Microorganisms, 12(2):373, 11 Feb 2024
Cited by: 0 articles | PMID: 38399777 | PMCID: PMC10892521
Informatics and Computational Approaches for the Discovery and Optimization of Natural Product-Inspired Inhibitors of the SARS-CoV-2 2'-O-Methyltransferase.
J Nat Prod, 87(2):217-227, 19 Jan 2024
Cited by: 2 articles | PMID: 38242544
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
Protein structures in PDBe (Showing 11 of 11)
-
(2 citations)
PDBe - 6wx4View structure
-
(1 citation)
PDBe - 4rspView structure
-
(1 citation)
PDBe - 7cjmView structure
-
(1 citation)
PDBe - 3v3mView structure
-
(1 citation)
PDBe - 5w8uView structure
-
(1 citation)
PDBe - 4MM3View structure
-
(1 citation)
PDBe - 3r24View structure
-
(1 citation)
PDBe - 6w4hView structure
-
(1 citation)
PDBe - 6nurView structure
-
(1 citation)
PDBe - 7bv2View structure
-
(1 citation)
PDBe - 6luvView structure
Show less
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Funding
Funders who supported this work.
NCCIH NIH HHS (1)
Grant ID: R01 AT007318
NIGMS NIH HHS (1)
Grant ID: R01 GM145845