Research article
Open access
Published: 19 December 2008

targetTB: A target identification pipeline for Mycobacterium tuberculosis through an interactome, reactome and genome-scale structural analysis

Karthik Raman¹,
Kalidas Yeturu¹ &
Nagasuma Chandra¹

BMC Systems Biology volume 2, Article number: 109 (2008) Cite this article

24k Accesses
176 Citations
3 Altmetric
Metrics details

Abstract

Background

Tuberculosis still remains one of the largest killer infectious diseases, warranting the identification of newer targets and drugs. Identification and validation of appropriate targets for designing drugs are critical steps in drug discovery, which are at present major bottle-necks. A majority of drugs in current clinical use for many diseases have been designed without the knowledge of the targets, perhaps because standard methodologies to identify such targets in a high-throughput fashion do not really exist. With different kinds of 'omics' data that are now available, computational approaches can be powerful means of obtaining short-lists of possible targets for further experimental validation.

Results

We report a comprehensive in silico target identification pipeline, targetTB, for Mycobacterium tuberculosis. The pipeline incorporates a network analysis of the protein-protein interactome, a flux balance analysis of the reactome, experimentally derived phenotype essentiality data, sequence analyses and a structural assessment of targetability, using novel algorithms recently developed by us. Using flux balance analysis and network analysis, proteins critical for survival of M. tuberculosis are first identified, followed by comparative genomics with the host, finally incorporating a novel structural analysis of the binding sites to assess the feasibility of a protein as a target. Further analyses include correlation with expression data and non-similarity to gut flora proteins as well as 'anti-targets' in the host, leading to the identification of 451 high-confidence targets. Through phylogenetic profiling against 228 pathogen genomes, shortlisted targets have been further explored to identify broad-spectrum antibiotic targets, while also identifying those specific to tuberculosis. Targets that address mycobacterial persistence and drug resistance mechanisms are also analysed.

Conclusion

The pipeline developed provides rational schema for drug target identification that are likely to have high rates of success, which is expected to save enormous amounts of money, resources and time in the drug discovery process. A thorough comparison with previously suggested targets in the literature demonstrates the usefulness of the integrated approach used in our study, highlighting the importance of systems-level analyses in particular. The method has the potential to be used as a general strategy for target identification and validation and hence significantly impact most drug discovery programmes.

Background

It is estimated that about two billion people, equalling one-third of the world's total population are infected with M. tuberculosis (Mtb) [1]. In 2006 alone, 1.7 million people died of tuberculosis (TB). TB is also the leading killer among HIV-infected people with weakened immune systems. The disease is also of particular interest to India and Asia, with more than half of all deaths occurring in Asia. Further, about 500,000 new multi-drug resistant TB cases are estimated to occur every year [1].

Currently, over 20 drugs are available for TB, of which, four of them, viz. isoniazid, rifampin, pyrazinamide and ethambutol are used as front-line drugs. Injectable drugs such as kanamycin, amikacin, capreomycin and viomycin are preferred next for treatment. Fluoroquinolones such as ciprofloxacin, ofloxacin have been found to be indispensable in the treatment of multi-drug resistant TB. Second-line bacteriostatics, such as p-aminosalicylic acid, ethionamide and cycloserine have established clinical efficacy but have more prominent side effects [2]. Isoniazid and ethionamide are inhibitors of mycolic acid synthesis [3, 4], while cycloserine and ethambutol inhibit synthesis of peptidoglycan [5] and cell wall arabinogalactan [6, 7] respectively, weakening the cell wall of the bacterium. Rifampin and Amikacin exert their pharmacological action by inhibiting bacterial RNA or protein synthesis [8–10]. As in the case of most other prescription drugs used currently, these were also discovered without the advantage of detailed molecular level information about the targets. A common strategy used in the past few decades for drug discovery involves finer structural optimisations, by starting with a lead compound that has already shown some success. Very often, this amounts to finding a newer improved drug, which modifies the function of the same target as the lead compound. This does not automatically lead to consideration of newer targets or even newer mechanisms of action. It is no surprise, therefore, that only a small fraction of the proteins in the bacterial genome have been explored as drug targets.

The existing drugs, although of immense value in controlling the disease to the extent that is being done today, have several shortcomings, the most important of them being the emergence of drug resistance rendering even the front-line drugs inactive. In addition, drugs such as rifampin have high levels of adverse effects making them prone for patient incompliance. Another important problem with most of the existing anti-mycobacterials, is their inability to act upon latent forms of the bacillus. In addition to these problems, the vicious interactions between the human immunodeficiency virus and TB have led to further challenges for anti-tubercular drug discovery [11]. For example, protease inhibitors have been shown to be incompatible with rifampin-containing anti-TB regimens [12]. As drug discovery efforts are increasingly becoming rational and much less dependent on trial and error, identification of appropriate targets becomes a fundamental pre-requisite.

Traditionally, targets have been identified through knowledge of the function of individual protein molecules, where their function has been well-characterised. Potential targets thus identified are generally taken through a validation process involving whole-cell or animal experiments, gene knock-outs or site-directed mutagenesis that lead to loss-of-function phenotypes. Target validation is one of the critical steps in drug discovery, where a lot of time and money is spent in the pharmaceutical industry. The need for systematic and large-scale validation in the post-genomic era has led to the usage of computational methods for validation [13]. Here, we seek to apply various in silico techniques for the identification and validation of drug targets, specifically for Mtb. In silico methods have the advantage of speed, low cost and even more importantly, provide a systems view of the whole microbe at a time, which enables asking questions that are often difficult to address experimentally. Drug discovery has witnessed a paradigm shift from the traditional medicinal chemistry-based ligand-oriented drug discovery approaches to rational drug target identification and target-driven lead discovery, by targeting the molecular mechanisms of disease. A number of studies have been carried out by various experimental methods to identify drug targets in Mtb [14]. Attempts have also been made for the same purpose, based on sequence comparisons of metabolic enzymes [15], and by using various features such as Lipinski druggability at the sequence level and metabolic choke-points at the systems-level [16].

Establishing systems biology concepts and understanding the microbe as a whole opens up new opportunities for computational target identification. Here, we report a comprehensive in silico target identification pipeline for Mtb, which can also be used as a general framework for in silico target identification. We focus our analysis at the systems level, based on network analyses and flux balance analyses (FBA), and further validating it based on sequence analyses and structural comparisons. We have used novel algorithms for the comparison of protein structures and identifying similarity of target pockets with pockets in the human proteome, which could initiate adverse drug effects. Gene expression data have also been considered to render the analysis more comprehensive.

Methods

The targetTB Pipeline

A new multi-level target identification pipeline, including a novel method for structural comparison of proteins has been developed. Different levels of abstraction are used for analysis, as discussed below. A summary of the several datasets used in these analyses is given in Table 1.

Table 1 Datasets used in this study

Full size table

Systems Analysis

Interactome Analysis

System Construction

We have constructed a protein-protein interaction network, based on the STRING database [17] version 7, which includes protein linkages between 3,925 Mtb proteins, inferred from published literature describing experimentally studied interactions, as well as those from genome analysis using several well-established methods such as domain fusion, phylogenetic profiling and gene neighbourhood concepts [18]. Thus, the network captures different types of interactions such as (a) physical complex formation between two proteins required to form a functional unit, (b) genes belonging to a single operon or to a common neighbourhood, (c) proteins in a given metabolic pathway and hence influenced by each other, (d) proteins whose associations are suggested based on predominant co-existence, co-expression, or domain fusion. Only the high-confidence interactions that had a STRING score of 0.7 or more were included in the network. We further augmented these with links between proteins that are influenced by the same metabolite, based on the reactions in the genome-scale metabolic reconstruction of Mtb, iNJ 661 [19]. The resulting network contained 3,405 of the 3,925 proteins.

Node Deletions

Networks may be perturbed, through the removal of nodes and edges. A typical analysis would be to probe the effect of disrupting a node and its corresponding edges. Networks of different topologies vary in their resilience to various types of perturbations. The effect of node deletions on this network was analysed. Each of the 3,405 nodes was knocked out and the critical network parameters such as clustering coefficient and characteristic path length were monitored. In addition, the number of shortest paths that were disrupted in each deletion were monitored. The shortest paths between all pairs of proteins in the network were computed. Following removal of a node, some of these shortest paths may be disrupted, leading two pairs of nodes becoming unreachable from one another. Based on the change (loss) in the connectivity of nodes in this network and the change in network structure, on the deletion of nodes, we have delineated potential targets.

Reactome Analysis

Two independent genome-scale metabolic models for Mtb have become available. A genome-scale metabolic network, comprising 849 reactions, mediated by 739 metabolites and involving 726 genes, reported by McFadden and co-workers (GSMN-TB) [20] has been considered. Jamshidi and Palsson have reported another genome-scale metabolic model of Mtb, iNJ 661, comprising 939 reactions mediated by 828 metabolites and 661 genes [19]. We have also earlier published a pathway-level model (MAP) of mycolic acid biosynthesis in Mtb [21]. We collated a list of lethal gene deletions for these studies. The essentiality predictions for the iNJ 661 model were based on growth in Middlebrook 7H9 medium, as detailed in [19], while those for the GSMN-TB model were based on growth in Middlebrook 7H10 medium, as detailed in [20]. Genes whose deletion severely impaired growth (biomass formation) in the medium were designated as essential. The essentiality in [21] was studied using an objective function for optimal production of mycolates; a gene was considered essential if on deletion, most fluxes in the mycolic acid pathway including those of the mycolates dropped to zero. Using the COBRA Toolbox [22] for MATLAB, we also performed double gene deletions for the iNJ 661 model.

Essentiality Analysis

Information on gene essentiality from a transposon site hybridisation (TraSH) mutagenesis study for Mtb [23] has also been incorporated in the decision criteria.

Sequence Analysis

Close homologues for the Mtb proteins in the human proteome were identified by performing a BLAST search [24]. The BLAST results were parsed using python scripts based on BioPython http://www.biopython.org/. The criteria for regarding a protein as a close homologue were a sequence similarity of greater than 50% using a BLOSUM62 matrix, for a length of more than 50% of the bacterial query protein with an E-value less than 10^-4.

Structural Assessment of Targetability

Obtaining Structures

Crystal structures of 229 proteins from Mtb and 3,515 from human are available (excluding those with greater than 70% sequence identity) from the Protein Data Bank (PDB). This translates to a mere 6% of the Mtb proteome and under 10% of the human proteome. However, thousands of protein structures from both host and pathogen could be obtained using theoretically calculated structural models, from the ModBase database. Models in ModBase are built on the principles of homology modelling using Modeller [25]. Models of 2,808 proteins from Mtb and 16,000 proteins from the human proteome were obtained from this database. The database hosts multiple models for each protein, depending on the number of confident templates available for that protein in the PDB. For this analysis, only the first model for each protein was considered. Also, only those proteins which passed the previous stages of filtering in the target identification pipeline were considered. Of the 942 Mtb proteins considered, only 773 had available structures in ModBase.

Pocket Identification

In order to predict binding sites of a modelled protein, we have used PocketDepth (PD) [26], a geometry-based algorithm that has been developed and validated earlier by our group, to predict potential binding grooves on the surface of the protein. All possible binding sites in the 773 proteins of Mtb and the 16,000 human proteins were identified using PD. PD uses the concept of depth, which reflects how central a given pocket is and not merely how deep a subspace is in the pocket. PD outputs predicted binding sites in the form of sets or clusters. From such clusters, protein neighbourhoods within 4.0Å are extracted to obtain the binding sites.

An additional method to identify binding pockets in protein structures was used to obtain a consensus prediction. LigsiteCSC [27], a geometric method based on vectors in eight directions on a grid, also incorporating amino acid conservation information within each protein family, was used for this purpose. Top ten PD clusters were first obtained for each protein, which were compared with the top three pockets obtained from LigsiteCSC. Only the common clusters were retained for further analysis. 767 of the Mtb proteins and 15,830 of the human proteins were feasible for analysis, by which 3,500 pockets were identified in Mtb and 70,149 pockets in human.

Pocket Comparison

The next step towards structural assessment of targetability is to compare the binding sites of shortlisted targets of Mtb with those of the human proteome. An algorithm developed by us very recently, PocketMatch (PM) [28], has been used for this purpose. PM is based on shape signatures encoded by 90 lists of all-pair distances of residues in the binding site, pre-classified into one of the five standard amino acid types. A similarity score is assigned to each pair of binding sites. Extensive validation for PM, using the PDBbind database [29] of experimentally determined protein-ligand complexes is reported elsewhere [28]. We have now tested the algorithm to compare predicted pockets of all proteins in PDBbind as well. The SCOP-PM comparison for predicted pockets at various thresholds is provided as supplementary material [See Additional file 1].

All the 3,500 identified sites from the 767 short-listed proteins from Mtb were compared with the 70,149 identified sites from 15,830 human proteins. The topmost score for every protein pair is then chosen to capture the highest similarity an Mtb protein has in any of its pockets with any human protein. The scores are then compared to a pre-defined threshold as discussed in the results section to infer similarity. The exhaustive pairwise comparison of pockets is highly computationally intensive and was carried out on a massively parallel BlueGene (configuration: 4096 2-way shared memory processor nodes: 8192 IBM PowerPC 440×5 processors operating at 700 MHz, running Linux).

Further Analysis of Short-listed Targets

The short-listed targets were subjected to further analysis, to retain only those proteins that are highly targetable.

Transcriptome Analysis/Gene Expression

One of the critical factors influencing the choice of a target would be its expression. Expression profiles related to persistence have been incorporated in [16]. Based on the expression of the genes, we have further filtered our list of targets. For this, we have used data from Small and co-workers [30], who have analysed the expression of genes in ten different strains of Mtb, M. tuberculosis H37Rv and M. tuberculosis H37Ra using cDNA microarrays. We also use data from Kaufmann and co-workers [31], who have performed a genome-wide expression analysis of Mtb from clinical lung samples using DNA arrays, and Barry and co-workers [32], who report an expression analysis of Mtb under a wide range of conditions. Lists of expressed genes have been reported in [30, 31], while in [32], the z-scores have been reported for gene expression, in each of the experiments. A gene passed this filter if it was reported to be expressed, by either of [30, 31], or in at least one of the studies (where an inhibitor of metabolism was not introduced) reported in [32].

Comparison with 'Anti-targets'

About seven proteins have been reported to form a set of 'anti-targets' [33], viz. the human ether-à-go-go-related gene (hERG), the pregnane X receptor (PXR), constitutive androstane receptor (CAR), P-glycoprotein (P-gp), as well as membrane receptors like the adrenergic α_1a, the dopaminergic D2, the serotonergic 5 – HT_2cand the muscarinic M₁. Unintentional binding of drugs to these proteins causes adverse effects, leading to their labelling as anti-targets. The sequences of 306 proteins in the human proteome corresponding to these anti-targets were fetched from the NCBI sequence database. The accession numbers of these protein sequences are provided as supplementary material [see Additional file 2]. The short-listed targets were compared to these anti-targets by standard sequence analysis.

Similarity to Gut Flora Proteins

A number of organisms are known to inhabit the gut of a normal healthy individual [34]. Inadvertent inhibition of proteins of these organisms is likely to result in side effects. In order to study this possibility, the short-listed Mtb proteins were compared to the proteins of the gut flora (296,017 proteins from 95 organisms), again by sequence analysis. Some of these organisms are Bacteroides intestinalis, Bifidobacterium bifidum, Bifidobacterium longum and Lactobacillus salivarius. A full list of the 95 organisms is provided as supplementary material [see Additional file 3].

Involvement in Persistence

Mtb has an unusual capacity to persist in the host at many levels. In the cellular level, it resides in macrophages that typically function to eliminate pathogens and at the systemic level, it resists clearance by the adaptive immunity of the host. Its clearance by anti-bacterials is also very slow [35]. It may be possible to address the problem of persistence by targeting those genes that are implicated in persistence. For example, isocitrate lyase is a well-known persistence factor in mice, whose disruption attenuated bacterial persistence [36]. pcaA, a cyclopropane synthase involved in mycolic acid biosynthesis has also been shown to be a requirement for long-term mycobacterial persistence and virulence in mice models of tubercular infection [37]. Targets that passed all the previous filters were examined for expression during persistence based on several microarray expression data [32, 38–41].

Phylogenetic Profiling

Phylogenetic profiling was carried out against 707 fully sequenced bacterial genomes. First, a BLAST was run against each of the 707 genomes, for Mtb. The BLAST output was then parsed using python scripts, based on BioPython, to obtain the E-value of the best hit, with a match of more than 50% of the query length, for each sequence in Mtb. The E-values thus obtained were converted to scores between 0 and 1, with 0 representing a strong match and 1 representing a weak match. The score was calculated as -1/log(E). Hits with E > e^-4 were all neglected and given a score of 1.0. This is identical to the scoring scheme of Protein Link EXplorer (PLEX) [42], which however currently considers only 89 genomes. For each protein in Mtb, profile strings comprising scores for the hits of the proteins were generated. Each profile string thus encodes the presence or absence of each of the Mtb proteins and where present, the extent of similarity as well. A subset of these results, for 228 pathogenic genomes, was analysed to examine the broad-spectrum nature of an identified target.

Involvement in Drug Resistance

Proteins involved in emergence of resistance to anti-tubercular drugs have been analysed and reported by us recently [43]. The list of about 25 proteins closely connected to different pathways of resistance were obtained and used for analysis here.

Results

A range of analyses spanning multiple levels of abstraction have been carried out, to identify plausible drug targets. The methodology can also be used more generally as a target identification pipeline that would be applicable to many drug discovery programmes. Starting from the entire proteome of Mtb H37Rv comprising 3,989 proteins, we have shortlisted 451 proteins as potential drug targets using a variety of filters, as depicted in Figs. 1 and 2. Fig. 1 illustrates a pictorial view of the targetTB pipeline while Fig. 2 shows a simplified view of the pipeline as a flowchart, illustrating the flow of this study. We first carry out a network analysis, where a full genome-scale interactome encoding several types of protein-protein interactions and protein-protein influences from metabolic pathways is reconstructed. Gene deletions that would significantly disrupt the network are then identified (List-A1). Next, we have studied the reactome through FBA (List-A2), to identify lethal gene deletions. This is further augmented with high-throughput gene essentiality data (List-A3). These system-level analyses together comprise Filter A. This is then integrated with sequence-level (Filter B) and structural analyses (Filter C) as described below (see Fig. 1). The expression of the gene encoding for the target is highly desirable (Filter E) and the list is further pruned by eliminating targets with high similarities to known 'anti-targets' in the human proteome (Filter F) and proteins in gut flora (Filter G). Those targets known to contribute to drug resistance in the pathogen are then prioritised. By analysis of similarity against several pathogenic proteomes, broad-spectrum targets as well as those unique to Mtb have also been identified. Various filters, lists and the numbers of proteins passed and eliminated at the various stages of the pipeline are given in Table 2.

Table 2 Models and methods used in the targetTB pipeline

Full size table