DNACoder: a CNN-LSTM attention-based network for genomic sequence data compression

K. S. Sheena¹ &
Madhu S. Nair¹

320 Accesses
Explore all metrics

Abstract

Genomic sequencing has become increasingly prevalent, generating massive amounts of data and facing a significant challenge in long-term storage and transmission. A solution that reduces the storage and transfer requirements without compromising data integrity is needed. The effectiveness of neural networks has already been endorsed in tasks like image and speech compression. Adapting them to recognize the intricate patterns in genomic sequences could help to find more redundancies and reduce storage requirements. The proposed method, called DNACoder, leverages deep learning techniques to achieve significant compression ratios while preserving the essential information in genomic data and offers a high-performance compression for genomic sequences in any data format. The results of the experiments clearly demonstrate the effectiveness of the method and its potential applications in genomic data storage. Our proposed method improves compression by 21.1% on bits per base compared to existing compressors on the benchmarked dataset. By using a deep learning prediction model that is structured as a convolutional layer followed by an attention-based long short-term memory network, we propose a novel lossless and reference-free compression approach (DNACoder), which can also be utilized as a reference-based compressor. The experimental outcome on the tested data illustrates that the advocated compression algorithm’s CNN-LSTM model makes generalizations effectively for genomic sequence data and outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compressing Genomic Sequences by Using Deep Learning

MZPAQ: a FASTQ data compression tool

Article Open access 03 June 2019

Human mitochondrial genome compression using machine learning techniques

Article Open access 22 October 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability statement

Data supporting this study is included within the article and/or supporting materials.

References

Watson JD, Crick FH (1953) The structure of DNA. In: Cold spring harbor symposia on quantitative biology, vol 18. https://doi.org/10.1101/sqb.1953.018.01.020
Batley J, Edwards D (2009) Genome sequence data: management, storage, and visualization. Biotechniques 46:333–336. https://doi.org/10.2144/000113134
Article Google Scholar
Church GM, Gilbert W (1984) Genomic sequencing. Proc Natl Acad Sci 81:1991–1995. https://doi.org/10.1073/pnas.81.7.1991
Article Google Scholar
Slatko BE, Gardner AF, Ausubel FM (2018) Overview of next-generation sequencing technologies. Curr Protoc Mol Biol 122(1):59
Article Google Scholar
Mardis ER (2017) DNA sequencing technologies: 2006–2016. Nat Protoc 12:213–218. https://doi.org/10.1038/nprot.2016.182
Article Google Scholar
Grumbach S, Tahi F (1993) Compression of DNA sequences. In: [Proceedings] DCC‘93: data compression conference, pp 340–350. https://doi.org/10.1109/DCC.1993.253115
Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) Deliminate a fast and efficient method for loss–less compression of genomic sequences: sequence analysis. Bioinformatics 28:2527–2529. https://doi.org/10.1093/bioinformatics/bts467
Article Google Scholar
7-Zip file archiver. https://www.7-zip.org
Pinho A, Pratas D (2013) MFcompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics (Oxford, England). https://doi.org/10.1093/bioinformatics/btt594
Article Google Scholar
Cao M, Dix T, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression, pp 43–52. https://doi.org/10.1109/DCC.2007.7
Kryukov K, Ueda MT, Nakagawa S, Imanishi T (2019) Nucleotide archival format (NAF) enables efficient lossless reference-free compression of DNA sequences. Bioinformatics 35:3826–3828. https://doi.org/10.1093/bioinformatics/btz144
Article Google Scholar
Zstandard: zstd. https://github.com/facebook/zstd
Xie X, Zhou S, Guan J (2015) COGI: towards compressing genomes as an image. IEEE/ACM Trans Comput Biol Bioinf 12:1275–1285. https://doi.org/10.1109/TCBB.2015.2430331
Article Google Scholar
Wang R, Zang T, Wang Y (2019) Human mitochondrial genome compression using machine learning techniques. Hum Genom 13:2225–2230. https://doi.org/10.1186/s40246-019-0225-3
Article Google Scholar
Silva M, Pratas D, Pinho AJ (2020) Efficient DNA sequence compression with neural networks. GigaScience. https://doi.org/10.1093/gigascience/giaa119
Article Google Scholar
Goyal M, Tatwawadi K, Chandak S, Ochoa I (2019) Deepzip: lossless data compression using recurrent neural networks. In: 2019 data compression conference (DCC), pp 575–575. https://doi.org/10.1109/DCC.2019.00087
Absardi ZN, Javidan R (2019) A fast reference-free genome compression using deep neural networks. In: 2019 big data, knowledge and control systems engineering (BdKCSE), pp 1–7. https://doi.org/10.1109/BdKCSE48644.2019.9010661
Lan D, Tobler R, Souilmi Y, Llamas B (2021) Genozip: a universal extensible genomic data compressor. Bioinformatics 37(16):2225–2230. https://doi.org/10.1093/bioinformatics/btab102
Article Google Scholar
Sheena KS, Nair MS (2024) GENCoder: a novel convolutional neural network based autoencoder for genomic sequence data compression. IEEE/ACM Trans Comput Biol Bioinform 21:405–415. https://doi.org/10.1109/TCBB.2024.3366240
Article Google Scholar
Barzola-Monteses J, Gomez-Romero J, Espinoza-Andaluz M, Fajardo W (2022) Hydropower production prediction using artificial neural networks: an Ecuadorian application case. Neural Comput Appl 34(16):13253–13266
Article Google Scholar
Uddin MZ, Dysthe KK, Følstad A, Brandtzaeg PB (2022) Deep learning for prediction of depressive symptoms in a large textual dataset. Neural Comput Appl 34(1):721–744
Article Google Scholar
Jin Z, Yang Y, Liu Y (2020) Stock closing price prediction based on sentiment analysis and LSTM. Neural Comput Appl 32:9713–9729
Article Google Scholar
Singhal V, Mathew J, Behera RK et al (2021) Detection of alcoholism using EEG signals and a CNN-LSTM-ATTN network. Comput Biol Med 138:104940
Article Google Scholar
Choi Y-A, Park S-J, Jun J-A, Pyo C-S, Cho K-H, Lee H-S, Yu J-H (2021) Deep learning-based stroke disease prediction system using real-time bio signals. Sensors 21(13):4269
Article Google Scholar
Mou H, Yu J (2021) CNN-LSTM prediction method for blood pressure based on pulse wave. Electronics 10(14):1664
Article Google Scholar
Jurtz VI, Johansen AR, Nielsen M, Almagro Armenteros JJ, Nielsen H, Sønderby CK, Winther O, Sønderby SK (2017) An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics 33(22):3685–3690
Article Google Scholar
Zhang Z, Zhao Y, Liao X, Shi W, Li K, Zou Q, Peng S (2019) Deep learning in omics: a survey and guideline. Brief Funct Genom 18(1):41–57
Article Google Scholar
Brémaud P (2001) Markov chains: Gibbs fields, Monte Carlo simulation, and queues, vol 31. Springer, Berlin
Google Scholar
Nomenclature committee of the international union of biochemistry (NC-IUB). nomenclature for incompletely specified bases in nucleic acid sequences, recommendations 1984. Eur J Biochem 150:1–5 (1985) https://doi.org/10.1111/j.1432-1033.1985.tb08977.x
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Understanding LSTM networks. https://colah.github.io/posts/2015-08-Understanding-LSTMs
Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR https://doi.org/10.48550/arXiv.1409.0473 arXiv:1409.0473
NCBI genome datasets. https://www.ncbi.nlm.nih.gov/data-hub/genome
numcompress. https://github.com/amit1rrr/numcompress
Salomon D (2006) Data compression: the complete reference. Springer, Boston
Google Scholar
Pratas D, Pinho AJ (2019) A DNA sequence corpus for compression benchmark. In: Fdez-Riverola F, Mohamad MS, Rocha M, De Paz JF, González P (eds) Practical applications of computational biology and bioinformatics, 12th international conference. Springer, Cham, pp 208–215
Google Scholar

Download references

Acknowledgements

The first author is thankful to the Cochin University of Science and Technology for the financial support through the University-SRF fellowship (File Ref.No.Ac.C3/2021 dated 27/02/2023).

Author information

Authors and Affiliations

Artificial Intelligence & Computer Vision Lab, Department of Computer Science, Cochin University of Science and Technology, Kochi, Kerala, 682022, India
K. S. Sheena & Madhu S. Nair

Authors

K. S. Sheena
View author publications
You can also search for this author in PubMed Google Scholar
Madhu S. Nair
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. S. Sheena.

Ethics declarations

Conflict of interest

We declare that all the authors have approved the manuscript and agree with the submission to this journal. There is no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sheena, K.S., Nair, M.S. DNACoder: a CNN-LSTM attention-based network for genomic sequence data compression. Neural Comput & Applic 36, 18363–18376 (2024). https://doi.org/10.1007/s00521-024-10130-4

Download citation

Received: 10 December 2023
Accepted: 26 June 2024
Published: 25 July 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s00521-024-10130-4

DNACoder: a CNN-LSTM attention-based network for genomic sequence data compression

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Compressing Genomic Sequences by Using Deep Learning

MZPAQ: a FASTQ data compression tool

Human mitochondrial genome compression using machine learning techniques

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

DNACoder: a CNN-LSTM attention-based network for genomic sequence data compression

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Compressing Genomic Sequences by Using Deep Learning

MZPAQ: a FASTQ data compression tool

Human mitochondrial genome compression using machine learning techniques

Explore related subjects

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now