Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

DNACoder: a CNN-LSTM attention-based network for genomic sequence data compression

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Genomic sequencing has become increasingly prevalent, generating massive amounts of data and facing a significant challenge in long-term storage and transmission. A solution that reduces the storage and transfer requirements without compromising data integrity is needed. The effectiveness of neural networks has already been endorsed in tasks like image and speech compression. Adapting them to recognize the intricate patterns in genomic sequences could help to find more redundancies and reduce storage requirements. The proposed method, called DNACoder, leverages deep learning techniques to achieve significant compression ratios while preserving the essential information in genomic data and offers a high-performance compression for genomic sequences in any data format. The results of the experiments clearly demonstrate the effectiveness of the method and its potential applications in genomic data storage. Our proposed method improves compression by 21.1% on bits per base compared to existing compressors on the benchmarked dataset. By using a deep learning prediction model that is structured as a convolutional layer followed by an attention-based long short-term memory network, we propose a novel lossless and reference-free compression approach (DNACoder), which can also be utilized as a reference-based compressor. The experimental outcome on the tested data illustrates that the advocated compression algorithm’s CNN-LSTM model makes generalizations effectively for genomic sequence data and outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability statement

Data supporting this study is included within the article and/or supporting materials.

References

  1. Watson JD, Crick FH (1953) The structure of DNA. In: Cold spring harbor symposia on quantitative biology, vol 18. https://doi.org/10.1101/sqb.1953.018.01.020

  2. Batley J, Edwards D (2009) Genome sequence data: management, storage, and visualization. Biotechniques 46:333–336. https://doi.org/10.2144/000113134

    Article  Google Scholar 

  3. Church GM, Gilbert W (1984) Genomic sequencing. Proc Natl Acad Sci 81:1991–1995. https://doi.org/10.1073/pnas.81.7.1991

    Article  Google Scholar 

  4. Slatko BE, Gardner AF, Ausubel FM (2018) Overview of next-generation sequencing technologies. Curr Protoc Mol Biol 122(1):59

    Article  Google Scholar 

  5. Mardis ER (2017) DNA sequencing technologies: 2006–2016. Nat Protoc 12:213–218. https://doi.org/10.1038/nprot.2016.182

    Article  Google Scholar 

  6. Grumbach S, Tahi F (1993) Compression of DNA sequences. In: [Proceedings] DCC‘93: data compression conference, pp 340–350. https://doi.org/10.1109/DCC.1993.253115

  7. Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) Deliminate a fast and efficient method for loss–less compression of genomic sequences: sequence analysis. Bioinformatics 28:2527–2529. https://doi.org/10.1093/bioinformatics/bts467

    Article  Google Scholar 

  8. 7-Zip file archiver. https://www.7-zip.org

  9. Pinho A, Pratas D (2013) MFcompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics (Oxford, England). https://doi.org/10.1093/bioinformatics/btt594

    Article  Google Scholar 

  10. Cao M, Dix T, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression, pp 43–52. https://doi.org/10.1109/DCC.2007.7

  11. Kryukov K, Ueda MT, Nakagawa S, Imanishi T (2019) Nucleotide archival format (NAF) enables efficient lossless reference-free compression of DNA sequences. Bioinformatics 35:3826–3828. https://doi.org/10.1093/bioinformatics/btz144

    Article  Google Scholar 

  12. Zstandard: zstd. https://github.com/facebook/zstd

  13. Xie X, Zhou S, Guan J (2015) COGI: towards compressing genomes as an image. IEEE/ACM Trans Comput Biol Bioinf 12:1275–1285. https://doi.org/10.1109/TCBB.2015.2430331

    Article  Google Scholar 

  14. Wang R, Zang T, Wang Y (2019) Human mitochondrial genome compression using machine learning techniques. Hum Genom 13:2225–2230. https://doi.org/10.1186/s40246-019-0225-3

    Article  Google Scholar 

  15. Silva M, Pratas D, Pinho AJ (2020) Efficient DNA sequence compression with neural networks. GigaScience. https://doi.org/10.1093/gigascience/giaa119

    Article  Google Scholar 

  16. Goyal M, Tatwawadi K, Chandak S, Ochoa I (2019) Deepzip: lossless data compression using recurrent neural networks. In: 2019 data compression conference (DCC), pp 575–575. https://doi.org/10.1109/DCC.2019.00087

  17. Absardi ZN, Javidan R (2019) A fast reference-free genome compression using deep neural networks. In: 2019 big data, knowledge and control systems engineering (BdKCSE), pp 1–7. https://doi.org/10.1109/BdKCSE48644.2019.9010661

  18. Lan D, Tobler R, Souilmi Y, Llamas B (2021) Genozip: a universal extensible genomic data compressor. Bioinformatics 37(16):2225–2230. https://doi.org/10.1093/bioinformatics/btab102

    Article  Google Scholar 

  19. Sheena KS, Nair MS (2024) GENCoder: a novel convolutional neural network based autoencoder for genomic sequence data compression. IEEE/ACM Trans Comput Biol Bioinform 21:405–415. https://doi.org/10.1109/TCBB.2024.3366240

    Article  Google Scholar 

  20. Barzola-Monteses J, Gomez-Romero J, Espinoza-Andaluz M, Fajardo W (2022) Hydropower production prediction using artificial neural networks: an Ecuadorian application case. Neural Comput Appl 34(16):13253–13266

    Article  Google Scholar 

  21. Uddin MZ, Dysthe KK, Følstad A, Brandtzaeg PB (2022) Deep learning for prediction of depressive symptoms in a large textual dataset. Neural Comput Appl 34(1):721–744

    Article  Google Scholar 

  22. Jin Z, Yang Y, Liu Y (2020) Stock closing price prediction based on sentiment analysis and LSTM. Neural Comput Appl 32:9713–9729

    Article  Google Scholar 

  23. Singhal V, Mathew J, Behera RK et al (2021) Detection of alcoholism using EEG signals and a CNN-LSTM-ATTN network. Comput Biol Med 138:104940

    Article  Google Scholar 

  24. Choi Y-A, Park S-J, Jun J-A, Pyo C-S, Cho K-H, Lee H-S, Yu J-H (2021) Deep learning-based stroke disease prediction system using real-time bio signals. Sensors 21(13):4269

    Article  Google Scholar 

  25. Mou H, Yu J (2021) CNN-LSTM prediction method for blood pressure based on pulse wave. Electronics 10(14):1664

    Article  Google Scholar 

  26. Jurtz VI, Johansen AR, Nielsen M, Almagro Armenteros JJ, Nielsen H, Sønderby CK, Winther O, Sønderby SK (2017) An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics 33(22):3685–3690

    Article  Google Scholar 

  27. Zhang Z, Zhao Y, Liao X, Shi W, Li K, Zou Q, Peng S (2019) Deep learning in omics: a survey and guideline. Brief Funct Genom 18(1):41–57

    Article  Google Scholar 

  28. Brémaud P (2001) Markov chains: Gibbs fields, Monte Carlo simulation, and queues, vol 31. Springer, Berlin

    Google Scholar 

  29. Nomenclature committee of the international union of biochemistry (NC-IUB). nomenclature for incompletely specified bases in nucleic acid sequences, recommendations 1984. Eur J Biochem 150:1–5 (1985) https://doi.org/10.1111/j.1432-1033.1985.tb08977.x

  30. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  31. Understanding LSTM networks. https://colah.github.io/posts/2015-08-Understanding-LSTMs

  32. Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025

  33. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR https://doi.org/10.48550/arXiv.1409.0473arXiv:1409.0473

  34. NCBI genome datasets. https://www.ncbi.nlm.nih.gov/data-hub/genome

  35. numcompress. https://github.com/amit1rrr/numcompress

  36. Salomon D (2006) Data compression: the complete reference. Springer, Boston

    Google Scholar 

  37. Pratas D, Pinho AJ (2019) A DNA sequence corpus for compression benchmark. In: Fdez-Riverola F, Mohamad MS, Rocha M, De Paz JF, González P (eds) Practical applications of computational biology and bioinformatics, 12th international conference. Springer, Cham, pp 208–215

    Google Scholar 

Download references

Acknowledgements

The first author is thankful to the Cochin University of Science and Technology for the financial support through the University-SRF fellowship (File Ref.No.Ac.C3/2021 dated 27/02/2023).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. S. Sheena.

Ethics declarations

Conflict of interest

We declare that all the authors have approved the manuscript and agree with the submission to this journal. There is no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheena, K.S., Nair, M.S. DNACoder: a CNN-LSTM attention-based network for genomic sequence data compression. Neural Comput & Applic 36, 18363–18376 (2024). https://doi.org/10.1007/s00521-024-10130-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-10130-4

Keywords

Navigation