Nothing Special   »   [go: up one dir, main page]

Skip to main content

Fast, Flexible, and Exact Minimum Flow Decompositions via ILP

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2022)

Abstract

Minimum flow decomposition (MFD)—the problem of finding a minimum set of paths that perfectly decomposes a flow—is a classical problem in Computer Science, and variants of it are powerful models in multiassembly problems in Bioinformatics (e.g. RNA assembly). However, because this problem and its variants are NP-hard, practical multiassembly tools either use heuristics or solve simpler, polynomial-time solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Many RNA assemblers also use integer linear programming (ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths. Moreover, the only exact solver for MFD does not scale to large instances, and cannot be efficiently generalized to practical MFD variants.

In this work, we provide the first practical ILP formulation for MFD (and thus the first fast and exact solver for MFD), based on encoding all of the exponentially many solution paths using only a quadratic number of variables. On both simulated and real flow graphs, our approach runs in under 13 s on average. We also show that our ILP formulation can be easily and efficiently adapted for many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors.

We hope that our results can remove the current tradeoff between the complexity of a multiassembly model and its tractability, and can lie at the core of future practical RNA assembly tools. Our implementations are freely available at github.com/algbio/MFD-ILP.

This work was partially funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO), partially by the Academy of Finland (grants No. 322595, 328877, 308030), and partially by the US NSF (award 1759522). The full version of this work is available at [8].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We note that this version has one subtlety to address: as discussed below, it is necessary to linearize products in the formulation to make it a true ILP (or MILP, in this case). To linearize products of the real variables, it is required that the real variables have closed bounds. However, if we solve k-FD for increasing k (and not binary search), we can use \(w_i \ge 0\), since no weight 0 path will be included. This introduces the limitation that this formulation could not be used to solve flow decomposition for a fixed k, but only if k is an upper bound on the solution size.

  2. 2.

    We refer the reader to the full version of this paper [8] for experiments on Problem 3 (MIFD) in comparison with \(\mathsf {IFDSolver}\), which is an implementation of a heuristic algorithm for MIFD by [53].

  3. 3.

    The full dataset from [40] is available at https://zenodo.org/record/1460998. We use the file rnaseq/sparse_quant_SRR020730.graph.

References

  1. Ahuja, R.K., et al.: Network Flows. Alfred P. Sloan School of Management, Cambridge (1988)

    Book  Google Scholar 

  2. Amarasinghe, S.L., et al.: Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21(1), 1–16 (2020)

    Article  Google Scholar 

  3. Baaijens, J.A., Stougie, L., Schönhuth, A.: Strain-aware assembly of genomes from mixed samples using flow variation graphs. In: Schwartz, R. (ed.) RECOMB 2020. LNCS, vol. 12074, pp. 221–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45257-5_14

    Chapter  Google Scholar 

  4. Bernard, E., et al.: Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 30(17), 2447–2455 (2014)

    Article  Google Scholar 

  5. Bixby, B.: The Gurobi optimizer. Transp. Res. Part B 41(2), 159–178 (2007)

    Article  Google Scholar 

  6. Canzar, S., et al.: CIDANE: comprehensive isoform discovery and abundance estimation. Genome Biol. 17(1), 1–18 (2016)

    Article  Google Scholar 

  7. Cohen, R., et al.: On the effect of forwarding table size on SDN network utilization. In: IEEE INFOCOM 2014-IEEE conference on computer communications, pp. 1734–1742. IEEE (2014)

    Google Scholar 

  8. Dias, F.H.C.: Fast, Flexible, and Exact Minimum Flow Decompositions via ILP. arXiv arXiv:2201.10923 (2022)

  9. Furini, F., Traversi, E.: Theoretical and computational study of several linearisation techniques for binary quadratic problems. Ann. Oper. Res. 279(1), 387–411 (2019). https://doi.org/10.1007/s10479-018-3118-2

    Article  MathSciNet  MATH  Google Scholar 

  10. Gatter, T., Stadler, P.F.: Ryūtō: network-flow based transcriptome reconstruction. BMC Bioinf. 20(1), 1–14 (2019). https://doi.org/10.1186/s12859-019-2786-5

    Article  Google Scholar 

  11. Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021). https://www.gurobi.com

  12. Gusfield, D.: Integer Linear Programming in Computational and Systems Biology: An Entry-Level Text and Course. Cambridge University Press, New York (2019)

    Google Scholar 

  13. Hagemann-Jensen, M., et al.: Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38(6), 708–714 (2020)

    Article  Google Scholar 

  14. Hartman, T., et al.: How to split a flow? In: 2012 Proceedings IEEE INFOCOM, pp. 828–836. IEEE (2012)

    Google Scholar 

  15. Hong, C.Y., et al.: Achieving high utilization with software-driven wan. In: Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, pp. 15–26 (2013)

    Google Scholar 

  16. Huang, K.K., et al.: Long-read transcriptome sequencing reveals abundant promoter diversity in distinct molecular subtypes of gastric cancer. Genome Biol. 22(1), 1–24 (2021). https://doi.org/10.1186/s13059-021-02261-x

    Article  Google Scholar 

  17. IBM ILOG CPLEX Optimization Studio: CPLEX Users Manual, ver. 12.7 (2017)

    Google Scholar 

  18. Khan, S., et al.: Safety and Completeness in Flow Decompositions for RNA Assembly. arXiv arXiv:2201.10372 (2022)

  19. Kim, D., et al.: Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37(8), 907–915 (2019)

    Article  Google Scholar 

  20. Kim, P.M., et al.: Analysis of copy number variants and segmental duplications in the human genome: Evidence for a change in the process of formation in recent evolutionary history. Genome Res. 18(12), 1865–1874 (2008)

    Article  Google Scholar 

  21. Kloster, K., et al.: A practical FPT algorithm for flow decomposition and transcript assembly. In: 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 75–86. SIAM (2018)

    Google Scholar 

  22. Kovaka, S., et al.: Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20(1), 1–13 (2019). https://doi.org/10.1186/s13059-019-1910-1

    Article  Google Scholar 

  23. Li, W.: RNASeqReadSimulator: a simple RNA-seq read simulator (2014)

    Google Scholar 

  24. Li, W., et al.: IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol. 18(11), 1693–1707 (2011)

    Article  MathSciNet  Google Scholar 

  25. Liberti, L.: Compact linearization for binary quadratic problems. 4OR(3), 31–245 (2007)

    Google Scholar 

  26. Lin, Y.-Y., et al.: CLIIQ: accurate comparative detection and quantification of expressed isoforms in a population. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 178–189. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_14

    Chapter  Google Scholar 

  27. Ma, C., et al.: Finding ranges of optimal transcript expression quantification in cases of non-identifiability. bioRxiv (2020). https://doi.org/10.1101/2019.12.13.875625, to appear at RECOMB 2021

  28. Mangul, S., et al.: An integer programming approach to novel transcript reconstruction from paired-end RNA-Seq reads. In: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, pp. 369–376 (2012)

    Google Scholar 

  29. Mao, S., et al.: Refshannon: a genome-guided transcriptome assembler using sparse flow decomposition. PLoS One 15(6), e0232946 (2020)

    Google Scholar 

  30. Maretty, L., et al.: Bayesian transcriptome assembly. Genome Biol. 15(10), 1–11 (2014)

    Google Scholar 

  31. Mumey, B., Shahmohammadi, S., McManus, K., Yaw, S.: Parity balancing path flow decomposition and routing. In: 2015 IEEE Globecom Workshops (GC Wkshps), pp. 1–6. IEEE (2015)

    Google Scholar 

  32. Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. Genet. 14(3), 157–167 (2013)

    Google Scholar 

  33. Ohst, J.P.: On the Construction of Optimal Paths from Flows and the Analysis of Evacuation Scenarios. Ph.D. thesis, University of Koblenz and Landau, Germany (2015)

    Google Scholar 

  34. Olsen, N., et al.: A study on flow decomposition methods for scheduling of electric buses in public transport based on aggregated time–space network models. Cent. Eur. J. Oper. Res. (2020). https://doi.org/10.1007/s10100-020-00705-6

  35. Patro, R., et al.: Salmon: accurate, versatile and ultrafast quantification from RNA-seq data using lightweight-alignment. BioRxiv, p. 021592 (2015)

    Google Scholar 

  36. Pertea, M., et al.: StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33(3), 290–295 (2015)

    Google Scholar 

  37. Safikhani, Z., et al.: SSP: an interval integer linear programming for de novo transcriptome assembly and isoform discovery of RNA-seq reads. Genomics 102(5–6), 507–514 (2013)

    Article  Google Scholar 

  38. Shah, S.P., et al.: The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 486(7403), 395–399 (2012)

    Article  Google Scholar 

  39. Shao, M., Kingsford, C.: Accurate assembly of transcripts through phase-preserving graph decomposition. Nat. Biotechnol. 35(12), 1167–1169 (2017)

    Article  Google Scholar 

  40. Shao, M., Kingsford, C.: Theory and a heuristic for the minimum path flow decomposition problem. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(2), 658–670 (2017)

    Article  Google Scholar 

  41. Stamm, S., et al.: Function of alternative splicing. Gene 344, 1–20 (2005)

    Article  Google Scholar 

  42. Taccari, L.: Integer programming formulations for the elementary shortest path problem. Eur. J. Oper. Res. 252(1), 122–130 (2016)

    Article  MathSciNet  Google Scholar 

  43. Tomescu, A.I., et al.: A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinf. 14, S1:51-S1:51 (2013). https://doi.org/10.1186/1471-2105-14-S5-S15

  44. Tomescu, A.I., et al.: Explaining a weighted DAG with few paths for solving genome-guided multi-assembly. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(6), 1345–1354 (2015)

    Article  Google Scholar 

  45. Töpfer, A., et al.: Probabilistic inference of viral quasispecies subject to recombination. J. Comput. Biol. 20(2), 113–123 (2013)

    Article  MathSciNet  Google Scholar 

  46. Trapnell, C., et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28(5), 511–515 (2010)

    Article  Google Scholar 

  47. Vatinlen, B., et al.: Simple bounds and greedy algorithms for decomposing a flow into a minimal set of paths. Eur. J. Oper. Res. 185(3), 1390–1401 (2008)

    Article  MathSciNet  Google Scholar 

  48. Vignuzzi, M., et al.: Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature 439(7074), 344–348 (2006)

    Article  Google Scholar 

  49. Voshall, A., Moriyama, E.N.: Next-generation transcriptome assembly: strategies and performance analysis. In: Bioinformatics in the Era of Post Genomics and Big Data, pp. 15–36 (2018)

    Google Scholar 

  50. Wächter, A., Biegler, L.T.: On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program. 106(1), 25–57 (2006). https://doi.org/10.1007/s10107-004-0559-y

    Article  MathSciNet  MATH  Google Scholar 

  51. Wang, E.T., et al.: Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221), 470–476 (2008)

    Article  Google Scholar 

  52. Westbrooks, K., Astrovskaya, I., Campo, D., Khudyakov, Y., Berman, P., Zelikovsky, A.: HCV Quasispecies assembly using network flows. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS, vol. 4983, pp. 159–170. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79450-9_15

  53. Williams, L., et al.: RNA transcript assembly using inexact flows. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1907–1914. IEEE (2019)

    Google Scholar 

  54. Williams, L., et al.: Flow decomposition with subpath constraints. In: 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2021)

    Google Scholar 

  55. Xing, Y., et al.: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 14(3), 426–441 (2004)

    Article  Google Scholar 

  56. Zagordi, O., et al.: ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinf. 12(1), 1–5 (2011). https://doi.org/10.1186/1471-2105-12-119

    Article  Google Scholar 

  57. Zhang, Q., et al.: Scallop2 enables accurate assembly of multiple-end RNA-seq data. bioRxiv (2021). https://doi.org/10.1101/2021.09.03.458862

  58. Zhao, J., et al.: Multitrans: an algorithm for path extraction through mixed integer linear programming for transcriptome assembly. IEEE/ACM Trans. Comput. Biol. Bioinf. (2021). https://doi.org/10.1109/TCBB.2021.3083277

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandru I. Tomescu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dias, F.H.C., Williams, L., Mumey, B., Tomescu, A.I. (2022). Fast, Flexible, and Exact Minimum Flow Decompositions via ILP. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04749-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04748-0

  • Online ISBN: 978-3-031-04749-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics