Nothing Special   »   [go: up one dir, main page]

Skip to main content

Alignment of biological networks by integer linear programming: virus-host protein-protein interaction networks

Abstract

Background

The alignment of protein-protein interaction networks was recently formulated as an integer quadratic programming problem, along with a linearization that can be solved by integer linear programming software tools. However, the resulting integer linear program has a huge number of variables and constraints, rendering it of no practical use.

Results

We present a compact integer linear programming reformulation of the protein-protein interaction network alignment problem, which can be solved using state-of-the-art mathematical modeling and integer linear programming software tools, along with empirical results showing that small biological networks, such as virus-host protein-protein interaction networks, can be aligned in a reasonable amount of time on a personal computer and the resulting alignments are structurally coherent and biologically meaningful.

Conclusions

The implementation of the integer linear programming reformulation using current mathematical modeling and integer linear programming software tools provided biologically meaningful alignments of virus-host protein-protein interaction networks.

Background

Many meaningful questions in molecular biology have been successfully answered through their translation into alignment problems for different mathematical structures. From simple structures, such as genomic or proteomic sequences, to richer structures, such as complex networks or whole biological systems, pairwise and multiple alignment have been used to compare these structures, inferring features and new biological relations from their alignment.

Several methods and software tools have been already introduced for the alignment of biological networks, including protein-protein interaction networks, metabolic pathways, and gene regulatory networks. They are addressed to solve interesting biological questions, such as the inference of protein-protein interactions and protein functions, the regulation of biological processes, and the metabolic capabilities of microorganisms. The alignment and analysis of protein-protein interaction networks has become a key ingredient to obtain functional orthologs and discover protein-protein interactions and their associated functions, as well as evolutionary conserved assembly pathways of protein complexes.

In the general network setting, and hence also in the particular case of protein-protein interaction networks, an alignment between two networks is an injective, but possibly partial, mapping from the set of nodes in one network (the source network) to the set of nodes in the other network (the target network). When the mapping defining the alignment has as domain the whole set of nodes of the source network, the alignment becomes an embedding of the source into the target network. Since biological networks are large networks, with hundreds to thousands of nodes and edges, most of the techniques developed for their alignment [15] are heuristic, and the alignments obtained by applying these techniques to the same biological networks often differ considerably and do not provide a true, consensus alignment. On the other hand, an exact solution to the network alignment problem can be obtained by an integer quadratic programming formulation [6], but its linearization [7] has a huge number of binary variables and constraints.

In this paper, we present a compact integer linear programming reformulation of the protein-protein interaction network alignment problem, which can be solved using state-of-the-art mathematical modeling and integer linear programming software tools. We also present empirical results showing that small biological networks, such as the virus-host protein-protein interaction networks in the STRING Viruses database [8], can be aligned in a reasonable amount of time on a personal computer and the resulting alignments are structurally coherent and biologically meaningful.

Results

The STRING Viruses database [8] contains sequences for 9,660,620 viral and host proteins and protein-protein interaction data for 230 viruses and 3 hosts: Homo sapiens (11,437,065 interactions), Saccharomyces cerevisiae (2,007,278 interactions), and Escherichia coli (1,166,900 interactions). We downloaded from STRING Viruses the virus-host protein-protein interaction data for Homo sapiens and all the protein sequence data (see Availability of data and materials).

Each of the protein-protein interactions is annotated with a combined score, an indicator of confidence ranging from 0 to 1, where a combined score of 0.5 indicates that roughly every second interaction might be a false positive. Therefore, we discarded any protein-protein interaction with a combined score under 0.510, keeping only interactions in the last 10% of the distribution of combined scores. Also, host-host protein-protein interactions were discarded since the alignment purpose in this experiment is the relation between the proteins of a virus and its host. Nevertheless, in a more general setting host-host protein-protein interactions can be also considered. Further, we discarded the smallest networks, those with 64 or less interactions, and focused our alignment experiments on the remaining 25 largest networks, which have between 56 and 735 viral and host proteins and between 65 and 957 virus-host protein-protein interactions. These networks are listed in Table S1, ranked by the number of interactions, and the viral proteins involved in them are listed in Tables S2–S26 [see Additional file 1].

Then, we aligned all possible pairs of these 25 networks. Due to the symmetry of the network alignment problem, we actually aligned 25·24/2=300 pairs of networks. We performed each of these 300 alignments using the compact integer linear programming formulation presented in this paper with AMPL version 2018.10.22 [9] and Gurobi Optimizer version 8.1.0, and also with some of the most popular protein-protein interaction network alignment tools: PINALOG [1], SPINAL [2], HubAlign [3], L-GRAAL [4], and AligNet [5], using default parameters for all of them. All of the alignments where computed using a personal computer with an Intel Core i7-8550U quad-core processor at 1.80 GHz and 32 GB of memory running Ubuntu 18.04 LTS. We took either the optimal solution or the best feasible solution that could be computed within a solver time limit of 60 minutes.

While our method is aimed at finding exact solutions to the problem of aligning protein-protein interaction networks, all of the aforementioned protein-protein interaction network alignment tools use an heuristic algorithm to obtain the final alignment. The general idea behind all of these alignment tools, is to define a node similarity measure that combine the similarity of the protein sequences with some network structure similarity. Then, the actual alignment is obtained based on node similarity. More precisely, in PINALOG, network structures are “communities,” which are scored and aligned based on a node similarity score that combines protein sequence similarity and GO terms, and aligned communities are extended to obtain the network alignment. In SPINAL, node similarity score is defined based on sequence similarity of nodes and of their neighbours, this score is iterated until some stability is reached, and the network alignment is obtained by a greedy, seed-and-extend approach. In HubAlign, network structures are “hubs” and “bottlenecks,” a score or weight is assigned to each node and edge of a network using an iterative minimum-degree heuristic algorithm to measure the topological and functional importance of a node (that is, the likelihood of being a hub or bottleneck), and the network alignment is obtained by choosing protein pairs with high alignment score by, again, a greedy, seed-and-extend approach. In L-GRAAL, node similarity is measured by considering 2-node to 4-node graphlet (connected subgraph) degree similarity and, based on node similarity, seeds are obtained using Integer Linear Programming (ILP) and Lagrangian relaxation and then extended to a network alignment using a greedy heuristic algorithm. Last, but not least, in AligNet, an overlapping clustering for every node in every network is computed. Then, all clusters pairs are aligned and scored based on sequence similarity of proteins and their neighbours. Finally, the clusters in one network are aligned with the clusters in the other network using the Hungarian algorithm, and local network alignments are first obtained as solutions to weighted bipartite hypergraph problem instances and then extended to a global network alignment.

In order to evaluate the alignments we considered the work reported in [10, 11] where several topological coherence and biological coherence measures were proposed for the comparison of protein-protein interaction network alignment methods and tools. It is shown in [11] that there is a strong correlation among the various topological coherence measures and also among the various biological coherence measures, while there is a weak correlation between the topological coherence measures and the biological coherence measures. Therefore, we have chosen one topological coherence measure and one biological coherence measure for assessing the quality of virus-host protein-protein interaction network alignments: EC, the edge correctness score, defined as the ratio of the interactions that are preserved by the alignment over the total number of interactions [10], and the sequence similarity score, a measure of functional coherence (FC), defined as the normalized sum of the sequence similarities (correlation of amino acid composition [12]) of the aligned proteins.

In Fig. 1 we show the boxplot of the edge correctness scores obtained for every alignment with the six alignment methods and tools considered in this study. We can observe there that L-GRAAL and our ILP method obtained the best results, with mean EC scores of 0.83 and 0.78, respectively. As far as biological coherence goes, in Fig. 2 we observe that PINALOG, being the alignment tool with the lowest EC scores, is the tool that reached the highest FC scores, with a mean FC score of 0.92, followed by our ILP method, with a mean FC score of 0.90. We can also observe that, as stated in [10, 11], some alignment methods and tools obtain either high EC scores but low FC scores, or low EC scores but high FC scores. As a measure of a balance between topological and biological coherence, we took the mean of the EC and FC scores, whose boxplot we show in Fig. 3. We can observe that, again, L-GRAAL and our ILP method obtained the best scores, followed by HubAlign and AligNet.

Fig. 1
figure 1

EC Scores (λ=0). Boxplot of EC scores for the 300 alignments of 25 virus-host protein-protein interaction networks from the STRING Viruses database, for λ=0. L-GRAAL and ILP obtained the highest scores

Fig. 2
figure 2

FC Scores (λ=1). Boxplot of FC scores for the 300 alignments of 25 virus-host protein-protein interaction networks from the STRING Viruses database, for λ=1. PINALOG, followed by ILP, obtained the highest scores

Fig. 3
figure 3

Combined EC and FC scores (λ=0.5). Boxplot of the mean of EC and FC scores for the 300 alignments of 25 virus-host protein-protein interaction networks from the STRING Viruses database, for λ=0.5. ILP and L-GRAAL obtained the highest scores

Table 1 shows the mean edge correctness and sequence similarity scores for all the six alignment approaches considered in this text, for the 300 pairs of virus-host protein-protein interaction networks from the STRING Viruses database described above. Moreover, Table 2 illustrates the trade-off between the conservation of interactions and the alignment of similar proteins, for a subset of 45 pairs of virus-host protein-protein interaction networks, as a function of a parameter λ[0,1] that controls the balance between protein similarity scores and protein-protein interaction weights in our model (see the “Methods” section for more details). With λ=0, we obtain an alignment with the highest topological coherence but with the lowest biological coherence, while λ=1 produces an alignment with the lowest topological coherence but with the highest biological coherence.

Table 1 Edge correctness score and sequence similarity score (mean values) for several protein-protein interaction network alignment methods and tools, for 300 pairs of virus-host protein-protein interaction networks from the STRING Viruses database, for λ=0.5. Sequence similarity scores are normalized global alignment scores
Table 2 Edge correctness score and sequence similarity score (mean values) for 45 pairs of virus-host protein-protein interaction networks from the STRING Viruses database, for the integer linear programming formulation and different values of the λ parameter. The maximum sum of the edge correctness and sequence similarity scores is achieved at λ=0.4, followed by λ=0.5. Sequence similarity scores are normalized global alignment scores

In order to measure the amount of variation or dispersion of the EC and FC scores used to evaluate the topological and biological coherence of the alignments, we introduced some noise to the virus-host protein-protein interaction networks by randomly adding and deleting 5% of the interactions. We computed 10,000 alignments between 100 random perturbations of the Marburg marburgvirus (taxid 11269) and 100 random perturbations of the Zaire ebolavirus (taxid 186538) virus-host protein-protein interaction networks. The mean and standard deviation of the EC and FC scores are 0.955413 and 0.012193 for the EC score and 0.991356 and 0.003269 for the FC score. That is, small perturbations of the virus-host protein-protein interaction networks produced small variations of the EC and FC scores.

While these results are based on a particular view of sequence similarity as correlation of amino acid composition, as mentioned above, it is possible to use the protein-protein interaction network alignment method with any measure of sequence similarity, including alignment-free measures such as the Euclidean distance between k-mer frequencies [12] and also alignment-based measures such as a normalized global alignment score. Table 3 shows the mean edge correctness and sequence similarity scores for different measures of sequence similarity (Euclidean distance between k-mer frequencies, for k between 1 and 4, and normalized global alignment score), for a subset of 45 pairs of virus-host protein-protein interaction networks with λ=0.5. The higher the value of k, the lower the mean sequence similarity score, with normalized global alignment giving the lowest score, but the mean edge correctness score is unaffected by the choice of sequence similarity measure.

Table 3 Edge correctness score and sequence similarity score (mean values) for 45 pairs of virus-host protein-protein interaction networks from the STRING Viruses database, for the integer linear programming formulation with λ=0.5 and different sequence similarity measures [12]

Discussion

To reinforce the statement that the integer linear programming formulation of the network alignment problem provides biologically meaningful alignments of virus-host protein-protein interaction networks, we analyzed the alignments in term of agreement on virus taxonomy. Namely, we considered the taxonomy classification of the virus in every virus-host protein-protein interaction network and assumed that the highest alignment scores must be obtained when considering closely related viruses. Indeed, Table 4 shows that the best alignment (measured by the mean value of edge correctness and sequence similarity) for each of the 25 virus-host protein-protein interaction networks in Table 5, correspond to a network in the same Baltimore class [13] for 21 of the 25 best alignments. Table 5 also shows the taxonomy classification of the 25 viruses considered in our study.

Table 4 Best alignment for the virus-host protein-protein interaction networks for human viruses in the STRING Viruses database considered in our study. Twenty-one of the 25 networks are aligned with networks corresponding to viruses of the same Baltimore class. Sequence similarity scores are normalized global alignment scores
Table 5 The virus-host protein-protein interaction networks for human viruses in the STRING Viruses database considered in our study

As a matter of fact, in class I (double-stranded DNA viruses), the Alphapapillomavirus 9 network is best aligned with the Human alphaherpesvirus 2 network; the Human betaherpesvirus 5 network is best aligned with the Human betaherpesvirus 6B network; the Human alphaherpesvirus 3 network is best aligned with the Human alphaherpesvirus 1 network; the Human alphaherpesvirus 1 and Human alphaherpesvirus 2 networks are best aligned with each other; and the Human betaherpesvirus 6A and Human betaherpesvirus 6B networks are also best aligned with each other.

In class IV (positive-sense single-stranded RNA viruses), the Human coronavirus 229E network is best aligned with the SARS-related coronavirus network. In class V (negative-sense single-stranded RNA viruses), the Influenza A virus network is best aligned with the Marburg marburgvirus network; the Human orthopneumovirus, Mumps rubulavirus, and Hendra henipavirus networks are best aligned with the Human metapneumovirus network; the Marburg marburgvirus and Zaire ebolavirus networks are best aligned with each other; and the Human metapneumovirus and Measles morbillivirus networks are also best aligned with each other.

Finally, in class VI (positive-sense single-stranded RNA viruses that replicate through a DNA intermediate), the Human immunodeficiency virus 2 network is best aligned with the Primate T-lymphotropic virus 2 network; the Primate T-lymphotropic virus 1 network is best aligned with the Primate T-lymphotropic virus 3 network; and the Primate T-lymphotropic virus 2 and Primate T-lymphotropic virus 3 networks are also best aligned with each other.

Conclusions

The compact integer linear programming reformulation of the protein-protein interaction network alignment problem can also be applied to similar alignment problems on graph-based representations of molecular structures, such as metabolic pathways and gene regulatory networks. The application to virus-host protein-protein interaction networks provided high scored alignments in both network topology and biological coherence, which constitutes evidence that the alignments obtained with this approach are biologically meaningful.

The alignment of virus-host protein-protein interaction networks may contribute to discover the effect of viral infection to their host. New databases with virus information have been created in the last years from the analysis of new metagenomics data [1416]. However, one of the problems to deal with nowadays is to understand the mechanism by which viruses infect a host and to determine the viral proteins interacting with host proteins that are responsible for such an infection. New sets of Gene Ontology classes have been developed that are applicable to microbes and their hosts, improving both coverage and quality in this area of the Gene Ontology [17]. Therefore, the alignment of virus-host protein-protein interactions can reveal a useful tool to predict new functions of viral proteins related to host infection, as it has been proven to be useful for inferring new protein functions.

Methods

The following notation will be used in this section. A protein-protein interaction network is represented by means of an undirected graph G=(V,E), where each node vV corresponds to a protein and each edge {u,v}E corresponds to an interaction between the proteins represented by the nodes uV and vV. Let G=(V,E) and G=(V,E) be the two protein-protein interaction networks to be aligned, let V={v1,…,vm} and V={v1′,…,vn′} be their respective sets of nodes and A=(aij) and B=(bk) be their respective adjacency matrices. Let S=(sik) be a similarity matrix between the nodes of the two networks, with each sik the similarity score of viV and vkV.

An alignment of G and G can be represented by a binary matrix X=(xik), where xik=1 if the i-th node, vi, of the first network is aligned with the k-th node, \(v^{\prime }_{k}\), of the second network, and xik=0 otherwise. Then, the protein-protein interaction network alignment problem has the following simple integer quadratic programming (IQP) formulation in terms of the binary variables xik [6].

Problem IQP. Objective:

$$\begin{aligned} &\max \lambda \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} s_{ik}\,x_{ik} \\ &+ (1 - \lambda) \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij}\,b_{k\ell}\,x_{ik}\,x_{j\ell} \end{aligned} $$

subject to the constraints

  • (Q1) xik{0,1}, i=1,…,m, k=1,…,n

  • (Q2) \(\sum \limits _{k=1}^{n} x_{ik} \leqslant 1,\quad i=1,\ldots,m\)

  • (Q3) \(\sum \limits _{i=1}^{m} x_{ik} \leqslant 1,\quad k=1,\ldots,n\)

In this problem’s objective function, λ is a parameter, with 0≤λ≤1, that controls the balance between protein similarity scores and protein-protein interaction weights: only node scores are considered when λ=1, and only edge scores are taken into account when λ=0. Constraints (Q2) and (Q3) enforce that, for every i=1,…,m, at most one xik is equal to 1 (that is, that the matrix X=(xik) defines a, possibly partial, mapping) and that, for every k=1,…,n, at most one xik is equal to 1 (that is, that the mapping defined by X is injective) and hence that the matrix X defines an alignment between the networks G and G, given by {(vi,vk′)V×V:xik=1}.

The objective function above comes from the PathBLAST [18] idea that protein-protein network alignment be based on a log-probability-like criterion, with matching terms corresponding to both proteins and interactions [6]. The first sum in the objective function,

$$\sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} s_{ik}\,x_{ik}, $$

represents the global similarity of the pairs of matching proteins, while the second sum,

$$\sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij}\,b_{k\ell}\,x_{ik}\,x_{j\ell}, $$

represents the number of edges that are preserved by the alignment; that is, of pairs of edges (vi,vj)E and (vk′,v′)E such that vi is aligned with vk′ and vj is aligned with v′.

This quadratic formulation has a linearization with O(m2n2) binary variables and constraints [7], of no practical use with current integer linear programming software tools such as IBM ILOG CPLEX Optimization Studio or Gurobi Optimizer. We present next a much more compact linearization, with only O(mn) binary variables, integer variables, and constraints, along the lines of a well-known linearization of the quadratic assignment problem [1921].

In addition to the binary variables xik above, we introduce an integer variable yik for each viV and each \(v^{\prime }_{k} \in V'\). Each such new variable yik is intended to represent

$$y_{ik} = x_{ik}\sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij} b_{k\ell} x_{j\ell} $$

for i=1,…,m and k=1,…,n. In this way, if xik=0,yik=0, and if xik=1,yik is the number of edges incident to vi in G that are preserved by the alignment.

Since

$$\begin{aligned} \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} y_{ik} = \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} x_{ik} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij} b_{k\ell} x_{j\ell} \\ = \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij} b_{k\ell} x_{ik} x_{j\ell}, \end{aligned} $$

using these new variables, the objective function of Problem IQP can be rewritten as a linear function:

$$\lambda \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} s_{ik}\,x_{ik} + (1 - \lambda) \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} y_{ik} $$

This motivates the following linear reformulation of problem IQP:

Problem ILP. Objective:

$$\max \lambda \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} s_{ik}\,x_{ik} + (1 - \lambda) \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{n} y_{ik} $$

subject to the constraints

  1. (L1)

    xik{0,1}, i=1,…,m, k=1,…,n

  2. (L2)

    \(\sum \limits _{k=1}^{n} x_{ik} \leqslant 1,\quad i=1,\ldots,m\)

  3. (L3)

    \(\sum \limits _{i=1}^{m} x_{ik} \leqslant 1,\quad k=1,\ldots,n\)

  4. (L4)

    \(0 \leqslant y_{ik} \leqslant x_{ik}\sum \limits _{j=1}^{m} \sum \limits _{\ell =1}^{n} a_{ij} b_{k\ell },\quad i=1,\ldots,m,\quad k=1,\ldots,n\)

  5. (L5)

    \(y_{ik} \leqslant \sum \limits _{j=1}^{m} \sum \limits _{\ell =1}^{n} a_{ij} b_{k\ell } x_{j\ell }\), i=1,…,m, k=1,…,n

This linear problem turns out to be equivalent to problem IQP, because of the following lemma:

Lemma 1

A binary matrix (xik) is a solution to Problem IQP if, and only if, there is an integer matrix (yik) such that ((xik),(yik)) is a solution to Problem ILP. Moreover, when λ<1, if (xik) is a solution to problem IQP and (yik)is such that ((xik),(yik)) is a solution to Problem ILP, then

$$y_{ik} = x_{ik} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij} b_{k\ell} x_{j\ell} $$

for every i=1,…,m and k=1,…,n.

Proof

If λ=1, the second sum in the objective function of both problems vanishes and therefore (xik) is a solution to problem IQP if, and only if, ((xik),(yik)) is a solution to problem ILP for every integer matrix (yik).

Now, assume that λ<1. It is clear from the problems’ objective functions that if (xik) is a solution to problem IQP, then taking for every i=1,…,m and k=1,…,n,

$$y_{ik} = x_{ik} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij} b_{k\ell} x_{j\ell} $$

we obtain a solution ((xik),(yik)) to problem ILP.

Conversely, assume that ((xik),(yik)) is a solution to problem ILP. If \(x_{i_{0}k_{0}}=0\), constraint (L4) implies that

$$y_{i_{0}k_{0}}=0=x_{i_{0}k_{0}} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{i_{0}j} b_{k_{0}\ell} x_{j\ell}. $$

And if \(x_{i_{0}k_{0}}=1\), by constraint (L5) we have that

$$y_{i_{0}k_{0}} \leqslant \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{i_{0}j} b_{k_{0}\ell} x_{j\ell}\leqslant x_{i_{0}k_{0}}\sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{i_{0}j} b_{k_{0}\ell} $$

and this turns out to imply that, actually,

$$y_{i_{0}k_{0}}=\sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{i_{0}j} b_{k_{0}\ell} x_{j\ell}=x_{i_{0}k_{0}}\sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{i_{0}j} b_{k_{0}\ell} x_{j\ell}. $$

Indeed, if \(x_{i_{0}k_{0}}=1\) and \(y_{i_{0}k_{0}}<{\sum \nolimits }_{j=1}^{m} {\sum \nolimits }_{\ell =1}^{n} a_{i_{0}j} b_{k_{0}\ell }\allowbreak x_{j_{0}\ell }\), then the pair of matrices \(\left (\left (x_{ik}\right),\left (\hat {y}_{ik}\right)\right)\) with \(\hat {y}_{ik}=y_{ik}\) except for \(\hat {y}_{i_{0}k_{0}}={\sum \nolimits }_{j=1}^{m} {\sum \nolimits }_{\ell =1}^{n} a_{i_{0}j} b_{k_{0}\ell } x_{j\ell }\), still satisfies constraints (L1) to (L5) and it has a larger value of the objective function in Problem ILP, which would contradict the assumption that ((xik),(yik)) is a solution to problem ILP.

This implies that, when λ<1, if ((xik),(yik)) is a solution to problem ILP, then

$$y_{ik}=x_{ik} \sum\limits_{j=1}^{m} \sum\limits_{\ell=1}^{n} a_{ij} b_{k\ell} x_{j\ell} $$

for every i=1,…,m and k=1,…,n. Since the constraints on (xik) are the same in both problems, we conclude that (xik) is a solution to problem IQP. □

Therefore, a solution ((xik),(yik)) of the linear reformulation ILP of the alignment problem defines an alignment between the mapped proteins in the two networks via {(vi,vk′)V×V:xik=1}.

Availability of data and materials

The datasets analysed during the current study are available in the STRING Viruses repository, http://viruses.string-db.org/download/protein.links.v10.5/9606.protein.links.v10.5.txt.gzand http://viruses.string-db.org/download/protein.sequences.v10.5.fa.gz.

Abbreviations

EC:

Edge correctness

FC:

Functional coherence

ILP:

Integer linear programming

IQP:

Integer quadratic programming

References

  1. Phan HTT, Sternberg MJE. PINALOG: A novel approach to align protein interaction networks—implications for complex detection and function prediction. Bioinformatics. 2012; 28(9):1239–45.

    Article  CAS  Google Scholar 

  2. Aladaǧ AE, Erten C. SPINAL: Scalable protein interaction network alignment. Bioinformatics. 2013; 29(7):917–24.

    Article  Google Scholar 

  3. Hashemifar S, Xu J. HubAlign: An accurate and efficient method for global alignment of protein-protein interaction networks. Bioinformatics. 2014; 30(17):438–44.

    Article  Google Scholar 

  4. Malod-Dognin N, Pržulj N. L-GRAAL: Lagrangian graphlet-based network aligner. Bioinformatics. 2015; 31(13):2182–9.

    Article  CAS  Google Scholar 

  5. Alberich R, Alcalà A, Llabrés M, Rosselló F, Valiente G. AligNet: Alignment of protein-protein interaction networks. arXiv e-prints. 2019; 11490:1902–07107.

    Google Scholar 

  6. Li Z, Wang Y, Zhang S, Zhang X-S, Chen L. Alignment of protein interaction networks by integer quadratic programming. In: Proc. 28th IEEE EMBS Ann. Int. Conf. New York, NY: IEEE: 2006. p. 5527–30.

    Google Scholar 

  7. Li Z, Zhang S, Wang Y, Zhang X-S, Chen L. Alignment of molecular networks by integer quadratic programming. Bioinformatics. 2007; 23(13):1631–9.

    Article  Google Scholar 

  8. Cook HV, Doncheva NT, Szklarczyk D, Mering CV, Jensen LJ. Viruses.STRING: A virus-host protein-protein interaction database. Viruses. 2018; 10(10):519.

    Article  Google Scholar 

  9. Fourer R, Gay DM, Kernighan BW. AMPL: a modeling language for mathematical programming, 2nd edn. Boston, Massachusetts: Cengage Learning; 2002.

    Google Scholar 

  10. Clark C, Kalita J. A comparison of algorithms for the pairwise alignment of biological networks. Bioinformatics. 2014; 30(16):2351–9.

    Article  CAS  Google Scholar 

  11. Malod-Dognin N, Ban K, Pržulj N. Unified alignment of protein-protein interaction networks. Sci Rep. 2017; 7(1):953.

    Article  Google Scholar 

  12. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017; 18(1):186.

    Article  Google Scholar 

  13. Baltimore D. Expression of animal virus genomes. Bacteriol Rev. 1971; 35(3):235–41.

    Article  CAS  Google Scholar 

  14. Paez-Espino D, Chen I-MA, Palaniappan K, et al. IMG/VR: A database of cultured and uncultured DNA viruses and retroviruses. Nucleic Acids Res. 2017; 45:457–65.

    Google Scholar 

  15. Grazziotin AL, Koonin EV, Kristensen DM. Prokaryotic virus orthologous groups (pVOGs): A resource for comparative genomics and protein family annotation. Nucleic Acids Res. 2017; 45:491–8.

    Article  Google Scholar 

  16. Hulo C, Castro ED, Masson P, Bougueleret L, Bairoch A, Xenarios I, Mercier PL. ViralZone: A knowledge resource to understand virus diversity. Nucleic Acids Res. 2011; 39:576–82.

    Article  Google Scholar 

  17. Foulger RE, Osumi-Sutherland D, McIntosh BK, Hulo C, Masson P, Poux S, Mercier PL, Lomax J. Representing virus-host interactions and other multi-organism processes in the Gene Ontology. BMC Microbiol. 2015; 15:146.

    Article  CAS  Google Scholar 

  18. Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res. 2014; 32:83–88.

    Article  Google Scholar 

  19. Glover F, Woolsey E. Further reduction of 0-1 polynomial programming problems to 0-1 linear programming problems. Oper Res. 1973; 21(1):156–61.

    Article  Google Scholar 

  20. Glover F, Woolsey E. Converting the 0-1 polynomial programming problem to a 0-1 linear program. Oper Res. 1974; 22(1):180–2.

    Article  Google Scholar 

  21. Kaufmann L, Broeckx F. An algorithm for the quadratic assignment problem using Benders’ decomposition. Eur J Oper Res. 1978; 2(3):207–11.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 21 Supplement 6, 2020: Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA-19): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-21-supplement-6.

Funding

Publication costs are funded by Spanish Ministry of Economy and Competitiveness and European Regional Development Fund project PGC2018-096956-B-C43 (MINECO/FEDER).

Author information

Authors and Affiliations

Authors

Contributions

ML, FR and GV conceived and coordinated the study, performed data analysis and drafted the manuscript. GR and GV performed all the bioinformatic analyses. All author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Gabriel Valiente.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

Supplementary materials (Tables S1–S26). (PDF 67.4 kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Llabrés, M., Riera, G., Rosselló, F. et al. Alignment of biological networks by integer linear programming: virus-host protein-protein interaction networks. BMC Bioinformatics 21 (Suppl 6), 434 (2020). https://doi.org/10.1186/s12859-020-03733-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-020-03733-w

Keywords