Abstract
Genomic sequencing has become increasingly prevalent, generating massive amounts of data and facing a significant challenge in long-term storage and transmission. A solution that reduces the storage and transfer requirements without compromising data integrity is needed. The effectiveness of neural networks has already been endorsed in tasks like image and speech compression. Adapting them to recognize the intricate patterns in genomic sequences could help to find more redundancies and reduce storage requirements. The proposed method, called DNACoder, leverages deep learning techniques to achieve significant compression ratios while preserving the essential information in genomic data and offers a high-performance compression for genomic sequences in any data format. The results of the experiments clearly demonstrate the effectiveness of the method and its potential applications in genomic data storage. Our proposed method improves compression by 21.1% on bits per base compared to existing compressors on the benchmarked dataset. By using a deep learning prediction model that is structured as a convolutional layer followed by an attention-based long short-term memory network, we propose a novel lossless and reference-free compression approach (DNACoder), which can also be utilized as a reference-based compressor. The experimental outcome on the tested data illustrates that the advocated compression algorithm’s CNN-LSTM model makes generalizations effectively for genomic sequence data and outperforms the state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability statement
Data supporting this study is included within the article and/or supporting materials.
References
Watson JD, Crick FH (1953) The structure of DNA. In: Cold spring harbor symposia on quantitative biology, vol 18. https://doi.org/10.1101/sqb.1953.018.01.020
Batley J, Edwards D (2009) Genome sequence data: management, storage, and visualization. Biotechniques 46:333–336. https://doi.org/10.2144/000113134
Church GM, Gilbert W (1984) Genomic sequencing. Proc Natl Acad Sci 81:1991–1995. https://doi.org/10.1073/pnas.81.7.1991
Slatko BE, Gardner AF, Ausubel FM (2018) Overview of next-generation sequencing technologies. Curr Protoc Mol Biol 122(1):59
Mardis ER (2017) DNA sequencing technologies: 2006–2016. Nat Protoc 12:213–218. https://doi.org/10.1038/nprot.2016.182
Grumbach S, Tahi F (1993) Compression of DNA sequences. In: [Proceedings] DCC‘93: data compression conference, pp 340–350. https://doi.org/10.1109/DCC.1993.253115
Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) Deliminate a fast and efficient method for loss–less compression of genomic sequences: sequence analysis. Bioinformatics 28:2527–2529. https://doi.org/10.1093/bioinformatics/bts467
7-Zip file archiver. https://www.7-zip.org
Pinho A, Pratas D (2013) MFcompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics (Oxford, England). https://doi.org/10.1093/bioinformatics/btt594
Cao M, Dix T, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression, pp 43–52. https://doi.org/10.1109/DCC.2007.7
Kryukov K, Ueda MT, Nakagawa S, Imanishi T (2019) Nucleotide archival format (NAF) enables efficient lossless reference-free compression of DNA sequences. Bioinformatics 35:3826–3828. https://doi.org/10.1093/bioinformatics/btz144
Zstandard: zstd. https://github.com/facebook/zstd
Xie X, Zhou S, Guan J (2015) COGI: towards compressing genomes as an image. IEEE/ACM Trans Comput Biol Bioinf 12:1275–1285. https://doi.org/10.1109/TCBB.2015.2430331
Wang R, Zang T, Wang Y (2019) Human mitochondrial genome compression using machine learning techniques. Hum Genom 13:2225–2230. https://doi.org/10.1186/s40246-019-0225-3
Silva M, Pratas D, Pinho AJ (2020) Efficient DNA sequence compression with neural networks. GigaScience. https://doi.org/10.1093/gigascience/giaa119
Goyal M, Tatwawadi K, Chandak S, Ochoa I (2019) Deepzip: lossless data compression using recurrent neural networks. In: 2019 data compression conference (DCC), pp 575–575. https://doi.org/10.1109/DCC.2019.00087
Absardi ZN, Javidan R (2019) A fast reference-free genome compression using deep neural networks. In: 2019 big data, knowledge and control systems engineering (BdKCSE), pp 1–7. https://doi.org/10.1109/BdKCSE48644.2019.9010661
Lan D, Tobler R, Souilmi Y, Llamas B (2021) Genozip: a universal extensible genomic data compressor. Bioinformatics 37(16):2225–2230. https://doi.org/10.1093/bioinformatics/btab102
Sheena KS, Nair MS (2024) GENCoder: a novel convolutional neural network based autoencoder for genomic sequence data compression. IEEE/ACM Trans Comput Biol Bioinform 21:405–415. https://doi.org/10.1109/TCBB.2024.3366240
Barzola-Monteses J, Gomez-Romero J, Espinoza-Andaluz M, Fajardo W (2022) Hydropower production prediction using artificial neural networks: an Ecuadorian application case. Neural Comput Appl 34(16):13253–13266
Uddin MZ, Dysthe KK, Følstad A, Brandtzaeg PB (2022) Deep learning for prediction of depressive symptoms in a large textual dataset. Neural Comput Appl 34(1):721–744
Jin Z, Yang Y, Liu Y (2020) Stock closing price prediction based on sentiment analysis and LSTM. Neural Comput Appl 32:9713–9729
Singhal V, Mathew J, Behera RK et al (2021) Detection of alcoholism using EEG signals and a CNN-LSTM-ATTN network. Comput Biol Med 138:104940
Choi Y-A, Park S-J, Jun J-A, Pyo C-S, Cho K-H, Lee H-S, Yu J-H (2021) Deep learning-based stroke disease prediction system using real-time bio signals. Sensors 21(13):4269
Mou H, Yu J (2021) CNN-LSTM prediction method for blood pressure based on pulse wave. Electronics 10(14):1664
Jurtz VI, Johansen AR, Nielsen M, Almagro Armenteros JJ, Nielsen H, Sønderby CK, Winther O, Sønderby SK (2017) An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics 33(22):3685–3690
Zhang Z, Zhao Y, Liao X, Shi W, Li K, Zou Q, Peng S (2019) Deep learning in omics: a survey and guideline. Brief Funct Genom 18(1):41–57
Brémaud P (2001) Markov chains: Gibbs fields, Monte Carlo simulation, and queues, vol 31. Springer, Berlin
Nomenclature committee of the international union of biochemistry (NC-IUB). nomenclature for incompletely specified bases in nucleic acid sequences, recommendations 1984. Eur J Biochem 150:1–5 (1985) https://doi.org/10.1111/j.1432-1033.1985.tb08977.x
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Understanding LSTM networks. https://colah.github.io/posts/2015-08-Understanding-LSTMs
Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR https://doi.org/10.48550/arXiv.1409.0473arXiv:1409.0473
NCBI genome datasets. https://www.ncbi.nlm.nih.gov/data-hub/genome
numcompress. https://github.com/amit1rrr/numcompress
Salomon D (2006) Data compression: the complete reference. Springer, Boston
Pratas D, Pinho AJ (2019) A DNA sequence corpus for compression benchmark. In: Fdez-Riverola F, Mohamad MS, Rocha M, De Paz JF, González P (eds) Practical applications of computational biology and bioinformatics, 12th international conference. Springer, Cham, pp 208–215
Acknowledgements
The first author is thankful to the Cochin University of Science and Technology for the financial support through the University-SRF fellowship (File Ref.No.Ac.C3/2021 dated 27/02/2023).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that all the authors have approved the manuscript and agree with the submission to this journal. There is no conflict of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sheena, K.S., Nair, M.S. DNACoder: a CNN-LSTM attention-based network for genomic sequence data compression. Neural Comput & Applic 36, 18363–18376 (2024). https://doi.org/10.1007/s00521-024-10130-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-10130-4