Abstract
Minimum flow decomposition (MFD)—the problem of finding a minimum set of paths that perfectly decomposes a flow—is a classical problem in Computer Science, and variants of it are powerful models in multiassembly problems in Bioinformatics (e.g. RNA assembly). However, because this problem and its variants are NP-hard, practical multiassembly tools either use heuristics or solve simpler, polynomial-time solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Many RNA assemblers also use integer linear programming (ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths. Moreover, the only exact solver for MFD does not scale to large instances, and cannot be efficiently generalized to practical MFD variants.
In this work, we provide the first practical ILP formulation for MFD (and thus the first fast and exact solver for MFD), based on encoding all of the exponentially many solution paths using only a quadratic number of variables. On both simulated and real flow graphs, our approach runs in under 13 s on average. We also show that our ILP formulation can be easily and efficiently adapted for many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors.
We hope that our results can remove the current tradeoff between the complexity of a multiassembly model and its tractability, and can lie at the core of future practical RNA assembly tools. Our implementations are freely available at github.com/algbio/MFD-ILP.
This work was partially funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO), partially by the Academy of Finland (grants No. 322595, 328877, 308030), and partially by the US NSF (award 1759522). The full version of this work is available at [8].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We note that this version has one subtlety to address: as discussed below, it is necessary to linearize products in the formulation to make it a true ILP (or MILP, in this case). To linearize products of the real variables, it is required that the real variables have closed bounds. However, if we solve k-FD for increasing k (and not binary search), we can use \(w_i \ge 0\), since no weight 0 path will be included. This introduces the limitation that this formulation could not be used to solve flow decomposition for a fixed k, but only if k is an upper bound on the solution size.
- 2.
- 3.
The full dataset from [40] is available at https://zenodo.org/record/1460998. We use the file rnaseq/sparse_quant_SRR020730.graph.
References
Ahuja, R.K., et al.: Network Flows. Alfred P. Sloan School of Management, Cambridge (1988)
Amarasinghe, S.L., et al.: Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21(1), 1–16 (2020)
Baaijens, J.A., Stougie, L., Schönhuth, A.: Strain-aware assembly of genomes from mixed samples using flow variation graphs. In: Schwartz, R. (ed.) RECOMB 2020. LNCS, vol. 12074, pp. 221–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45257-5_14
Bernard, E., et al.: Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 30(17), 2447–2455 (2014)
Bixby, B.: The Gurobi optimizer. Transp. Res. Part B 41(2), 159–178 (2007)
Canzar, S., et al.: CIDANE: comprehensive isoform discovery and abundance estimation. Genome Biol. 17(1), 1–18 (2016)
Cohen, R., et al.: On the effect of forwarding table size on SDN network utilization. In: IEEE INFOCOM 2014-IEEE conference on computer communications, pp. 1734–1742. IEEE (2014)
Dias, F.H.C.: Fast, Flexible, and Exact Minimum Flow Decompositions via ILP. arXiv arXiv:2201.10923 (2022)
Furini, F., Traversi, E.: Theoretical and computational study of several linearisation techniques for binary quadratic problems. Ann. Oper. Res. 279(1), 387–411 (2019). https://doi.org/10.1007/s10479-018-3118-2
Gatter, T., Stadler, P.F.: Ryūtō: network-flow based transcriptome reconstruction. BMC Bioinf. 20(1), 1–14 (2019). https://doi.org/10.1186/s12859-019-2786-5
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021). https://www.gurobi.com
Gusfield, D.: Integer Linear Programming in Computational and Systems Biology: An Entry-Level Text and Course. Cambridge University Press, New York (2019)
Hagemann-Jensen, M., et al.: Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38(6), 708–714 (2020)
Hartman, T., et al.: How to split a flow? In: 2012 Proceedings IEEE INFOCOM, pp. 828–836. IEEE (2012)
Hong, C.Y., et al.: Achieving high utilization with software-driven wan. In: Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, pp. 15–26 (2013)
Huang, K.K., et al.: Long-read transcriptome sequencing reveals abundant promoter diversity in distinct molecular subtypes of gastric cancer. Genome Biol. 22(1), 1–24 (2021). https://doi.org/10.1186/s13059-021-02261-x
IBM ILOG CPLEX Optimization Studio: CPLEX Users Manual, ver. 12.7 (2017)
Khan, S., et al.: Safety and Completeness in Flow Decompositions for RNA Assembly. arXiv arXiv:2201.10372 (2022)
Kim, D., et al.: Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37(8), 907–915 (2019)
Kim, P.M., et al.: Analysis of copy number variants and segmental duplications in the human genome: Evidence for a change in the process of formation in recent evolutionary history. Genome Res. 18(12), 1865–1874 (2008)
Kloster, K., et al.: A practical FPT algorithm for flow decomposition and transcript assembly. In: 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 75–86. SIAM (2018)
Kovaka, S., et al.: Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20(1), 1–13 (2019). https://doi.org/10.1186/s13059-019-1910-1
Li, W.: RNASeqReadSimulator: a simple RNA-seq read simulator (2014)
Li, W., et al.: IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol. 18(11), 1693–1707 (2011)
Liberti, L.: Compact linearization for binary quadratic problems. 4OR(3), 31–245 (2007)
Lin, Y.-Y., et al.: CLIIQ: accurate comparative detection and quantification of expressed isoforms in a population. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 178–189. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_14
Ma, C., et al.: Finding ranges of optimal transcript expression quantification in cases of non-identifiability. bioRxiv (2020). https://doi.org/10.1101/2019.12.13.875625, to appear at RECOMB 2021
Mangul, S., et al.: An integer programming approach to novel transcript reconstruction from paired-end RNA-Seq reads. In: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, pp. 369–376 (2012)
Mao, S., et al.: Refshannon: a genome-guided transcriptome assembler using sparse flow decomposition. PLoS One 15(6), e0232946 (2020)
Maretty, L., et al.: Bayesian transcriptome assembly. Genome Biol. 15(10), 1–11 (2014)
Mumey, B., Shahmohammadi, S., McManus, K., Yaw, S.: Parity balancing path flow decomposition and routing. In: 2015 IEEE Globecom Workshops (GC Wkshps), pp. 1–6. IEEE (2015)
Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. Genet. 14(3), 157–167 (2013)
Ohst, J.P.: On the Construction of Optimal Paths from Flows and the Analysis of Evacuation Scenarios. Ph.D. thesis, University of Koblenz and Landau, Germany (2015)
Olsen, N., et al.: A study on flow decomposition methods for scheduling of electric buses in public transport based on aggregated time–space network models. Cent. Eur. J. Oper. Res. (2020). https://doi.org/10.1007/s10100-020-00705-6
Patro, R., et al.: Salmon: accurate, versatile and ultrafast quantification from RNA-seq data using lightweight-alignment. BioRxiv, p. 021592 (2015)
Pertea, M., et al.: StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33(3), 290–295 (2015)
Safikhani, Z., et al.: SSP: an interval integer linear programming for de novo transcriptome assembly and isoform discovery of RNA-seq reads. Genomics 102(5–6), 507–514 (2013)
Shah, S.P., et al.: The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 486(7403), 395–399 (2012)
Shao, M., Kingsford, C.: Accurate assembly of transcripts through phase-preserving graph decomposition. Nat. Biotechnol. 35(12), 1167–1169 (2017)
Shao, M., Kingsford, C.: Theory and a heuristic for the minimum path flow decomposition problem. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(2), 658–670 (2017)
Stamm, S., et al.: Function of alternative splicing. Gene 344, 1–20 (2005)
Taccari, L.: Integer programming formulations for the elementary shortest path problem. Eur. J. Oper. Res. 252(1), 122–130 (2016)
Tomescu, A.I., et al.: A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinf. 14, S1:51-S1:51 (2013). https://doi.org/10.1186/1471-2105-14-S5-S15
Tomescu, A.I., et al.: Explaining a weighted DAG with few paths for solving genome-guided multi-assembly. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(6), 1345–1354 (2015)
Töpfer, A., et al.: Probabilistic inference of viral quasispecies subject to recombination. J. Comput. Biol. 20(2), 113–123 (2013)
Trapnell, C., et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28(5), 511–515 (2010)
Vatinlen, B., et al.: Simple bounds and greedy algorithms for decomposing a flow into a minimal set of paths. Eur. J. Oper. Res. 185(3), 1390–1401 (2008)
Vignuzzi, M., et al.: Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature 439(7074), 344–348 (2006)
Voshall, A., Moriyama, E.N.: Next-generation transcriptome assembly: strategies and performance analysis. In: Bioinformatics in the Era of Post Genomics and Big Data, pp. 15–36 (2018)
Wächter, A., Biegler, L.T.: On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program. 106(1), 25–57 (2006). https://doi.org/10.1007/s10107-004-0559-y
Wang, E.T., et al.: Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221), 470–476 (2008)
Westbrooks, K., Astrovskaya, I., Campo, D., Khudyakov, Y., Berman, P., Zelikovsky, A.: HCV Quasispecies assembly using network flows. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS, vol. 4983, pp. 159–170. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79450-9_15
Williams, L., et al.: RNA transcript assembly using inexact flows. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1907–1914. IEEE (2019)
Williams, L., et al.: Flow decomposition with subpath constraints. In: 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2021)
Xing, Y., et al.: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 14(3), 426–441 (2004)
Zagordi, O., et al.: ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinf. 12(1), 1–5 (2011). https://doi.org/10.1186/1471-2105-12-119
Zhang, Q., et al.: Scallop2 enables accurate assembly of multiple-end RNA-seq data. bioRxiv (2021). https://doi.org/10.1101/2021.09.03.458862
Zhao, J., et al.: Multitrans: an algorithm for path extraction through mixed integer linear programming for transcriptome assembly. IEEE/ACM Trans. Comput. Biol. Bioinf. (2021). https://doi.org/10.1109/TCBB.2021.3083277
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dias, F.H.C., Williams, L., Mumey, B., Tomescu, A.I. (2022). Fast, Flexible, and Exact Minimum Flow Decompositions via ILP. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-04749-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)