Abstract
With the advent of deep learning techniques for text generation, comes the possibility of generating fully simulated or synthetic genomes. For this study, the dataset of interest is that of coronaviruses. Coronaviridae are a family of positive-sense RNA viruses capable of infecting humans and animals. These viruses usually cause mild to moderate upper respiratory tract infection; however, they can also cause more severe symptoms, gastrointestinal and central nervous system diseases. The viruses are capable of flexibly adapting to new environments, hence health threats from coronavirus are constant and long-term. Immunogenic spike proteins are glycoproteins found on the surface of Coronaviridae particles that mediate entry to host cells. The aim of this study was to train deep learning neural networks to produce simulated spike protein sequences, which may be able to aid in knowledge and/or vaccine design by creating alternative possible spike sequences that could arise from zoonotic sources in future. Deep learning recurrent neural networks (RNN) were trained to provide computer-simulated coronavirus spike protein sequences in the style of previously known sequences and examine their characteristics. The deep generative model was created as a recurrent neural network employing text embedding and gated recurrent unit layers in TensorFlow Keras. Training used a dataset of alpha, beta, gamma, and delta coronavirus spike sequences. In a set of 100 simulated sequences, all 100 had most significant BLAST matches to Spike proteins in searches against NCBI non-redundant dataset (NR) and possessed the expected Pfam domain matches. Simulated sequences from the neural network may be able to guide us with future prospective targets for vaccine discovery in advance of a potential novel zoonosis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Data Availability
The model and source code are available at: https://github.com/LCrossman.
References
Organization WH: Consensus document on the epidemiology of severe acute respiratory syndrome (SARS). WHO/CDS/CSR/GAR/2003.11 (2003)
Zaki, A.M., Van Boheemen, S., Bestebroer, T.M., et al.: Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. N. Engl. J. Med. (2012). https://doi.org/10.1056/NEJMoa1211721
Zhou, P., Fan, H., Lan, T., et al.: Fatal swine acute diarrhoea syndrome caused by an HKU2-related coronavirus of bat origin. Nature (2018). https://doi.org/10.1038/s41586-018-0010-9
Zhu, N., Zhang, D., Wang, W., et al.: A novel coronavirus from patients with pneumonia in China, 2019. N. Engl. J. Med. (2020). https://doi.org/10.1056/NEJMoa2001017
Goodsell, D.: Molecule of the Month SARS-CoV-2 Spike (2020). https://doi.org/10.2210/rcsb_pdb/mom_2020_6. http://pdb101.rcsb.org/motm/246. Accessed 14 June 2022
Li, F.: Structure, function, and evolution of coronavirus spike proteins. Annu. Rev. Virol. (2016). https://doi.org/10.1146/annurev-virology-110615-042301
Zhou, G., Zhao, Q.: Perspectives on therapeutic neutralizing antibodies against the Novel Coronavirus SARS-CoV-2. Int. J. Biol. Sci. (2020). https://doi.org/10.7150/ijbs.45123
Cho, K., Van Merriënboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Zhou, P., Lou, Y.X., Wang, X.G., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature (2020). https://doi.org/10.1038/s41586-020-2012-7
Wu, Z., Yang, L., Ren, X., et al.: ORF8-related genetic evidence for Chinese horseshoe bats as the source of human severe acute respiratory syndrome coronavirus. J. Infect. Dis. (2016). https://doi.org/10.1093/infdis/jiv476
Luan, J., Lu, Y., Jin, X., Zhang, L.: Spike protein recognition of mammalian ACE2 predicts the host range and an optimized ACE2 for SARS-CoV-2 infection. Biochem. Biophys. Res. Commun. (2020). https://doi.org/10.1016/j.bbrc.2020.03.047
Lan, J., Ge, J., Yu, J., et al.: Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature (2020). https://doi.org/10.1038/s41586-020-2180-5
Crooks, G.E., Hon, G., Chandonia, J.M., Brenner, S.E.: WebLogo: a sequence logo generator. Genome Res. (2004). https://doi.org/10.1101/gr.849004
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Crossman, L.C. (2022). Deep Recurrent Neural Networks for the Generation of Synthetic Coronavirus Spike Protein Sequences. In: Chicco, D., et al. Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2021. Lecture Notes in Computer Science(), vol 13483. Springer, Cham. https://doi.org/10.1007/978-3-031-20837-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-20837-9_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20836-2
Online ISBN: 978-3-031-20837-9
eBook Packages: Computer ScienceComputer Science (R0)