New03 Application of Transformers in Bioinfo
New03 Application of Transformers in Bioinfo
New03 Application of Transformers in Bioinfo
https://doi.org/10.1093/bioadv/vbad001
Advance Access Publication Date: 11 January 2023
Review
Abstract
Summary: The transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved
revolutionary breakthroughs in the field of natural language processing (NLP). Since there are inherent similarities
between various biological sequences and natural languages, the remarkable interpretability and adaptability of
these models have prompted a new wave of their application in bioinformatics research. To provide a timely and
comprehensive review, we introduce key developments of transformer-based language models by describing the
detailed structure of transformers and summarize their contribution to a wide range of bioinformatics research from
basic sequence analysis to drug discovery. While transformer-based applications in bioinformatics are diverse and
multifaceted, we identify and discuss the common challenges, including heterogeneity of training data, computa-
tional expense and model interpretability, and opportunities in the context of bioinformatics research. We hope that
the broader community of NLP researchers, bioinformaticians and biologists will be brought together to foster fu-
ture research and development in transformer-based language models, and inspire novel bioinformatics applica-
tions that are unattainable by traditional methods.
Contact: wanwen@stanford.edu
Supplementary information: Supplementary data are available at Bioinformatics Advances online.
Fig. 1. The focus of this review article and some classic language models frameworks. (A) Relationships of artificial intelligence, machine learning, natural language processing,
transformer-based language models and bioinformatics. The blue square denotes the focal point of this review article. (B) Two common models in Word2Vec: CBOW
(Continuous Bag-of-Words Model) and Skip-gram (Continuous Skip-gram Model). (C) The structure of transformer model. (D) The structure of BERT model
Transformer-based language models in bioinformatics 3
sequentially process all past states and compress contextual informa- benefit the computer science community but also the broader com-
tion into a bottleneck with long input sequences (Bengio et al., munity of bioinformaticians and biologists, and further provide
1994; Pascanu et al., 2013). For example, Seq2Seq (Sutskever et al., insights for future bioinformatics research across multiple disciplines
2014), the first encoder–decoder model in machine translation tasks, that are unattainable by traditional methods.
supports variable-length inputs and outputs but is still limited by its
infrastructure LSTM. The Transformer (Vaswani et al., 2017)
model was then developed by Google, which completely abandoned 2 Basics of transformer-based language models
RNN-based network structures, and only used the multi-head atten-
Language models are trained in a self-supervised fashion (Liu et al.,
tion mechanism (Fig. 1C). Transformer does not rely on the past
2023). Compared to supervised learning (Hastie et al., 2009), which
hidden states to capture the dependency on the previous words.
usually needs human annotations, language models could use mas-
Instead, transformer processes a sentence as a whole to allow for
sive amounts of unannotated corpora from the internet, books, etc.
parallel computing and alleviates the vanishing gradient and per-
parameters W Q ; W K and W V are concatenated, and once again FFNðxÞ ¼ maxð0; xW1 þ b1 ÞW2 þ b2 : (3)
projected with parameter, resulting in the final values, as depicted as
follows:
(Zhang et al., 2018). Layer normalization can accelerate the training This process will be repeated, appending the new output into the in-
process of the model by normalizing the output of the former layers put sequence. To end the loop, an ‘END’ token is appended to the
to make it converge faster (Ba et al., 2016). lexicon. The loop stops when the output token is ‘END’, resulting in
the complete final output sequence. Because of the extra ‘BEGIN’
2.4 Position encodings token, the decoder’s input is shifted one position to the right
Since transformer uses pure self-attention without recurrence or con- (Fig. 4).
volution to capture connections between tokens, it cannot identify It is worth mentioning that when generating an output token,
the order of the tokens in the sequence. Therefore, transformer adds the input sequence only contains the tokens before it. When passing
position encodings to the input embeddings (Liu et al., 2020) to re- through the first attention layer, the queries, values and keys after
flect the absolute or relative position of the tokens in the sequence. this token will be masked and will not participate in the attention
The absolute position encoding informs the transformer architecture calculation. The decoder’s input in the current round, which is the
Fig. 3. The first step of the decoding process. The decoder predicts which token to output with its input and the output of the encoder. The decoder takes a special token
‘BEGIN’ as input, combining it with the encoder’s output to generate the probability distribution vector. The length of this vector is the size of the lexicon, and each dimension
of the output probability distribution vector represents the probability of a certain token in the lexicon. The output vector is then applied to a softmax function to generate a
probability distribution, and the token in the lexicon with the highest probability is the corresponding output, which is also the first token in the final output sequence
6 S.Zhang et al.
Fig. 5. Structure of the cross-attention layer. The encoder block in this figure refers to a certain block in encoder whose output participate in cross-attention with the decoder.
Masked self-attention refers to the first attention sub-layer in decoder block. Tiði¼1;2;3;4Þ is the ith token’s output of the encoder block shown in this figure and also the ith
0 0
token’s input of next encoder block. Kiði¼1;2;3;4Þ and Viði¼1;2;3;4Þ are the key matrix and the value matrix of Ti . Q1 is the corresponding query matrix of T1 , which is the first
token’s output of masked self-attention. Cross-attention uses the decoder’s query and the encoder’s keys and values to calculate the attention function, and the output of cross-
attention will be fed into the feed-forward layer in decoder block
Transformer-based language models in bioinformatics 7
Table 1. Summary and comparison of the representative applications of transformer-based language models in different fields of
bioinformatics
Field Paper Pre-trained model? (Y/N) Main focus Data repositories address
(continued)
8 S.Zhang et al.
Table 1. (continued)
Field Paper Pre-trained model? (Y/N) Main focus Data repositories address
(continued)
Transformer-based language models in bioinformatics 9
Table 1. (continued)
Field Paper Pre-trained model? (Y/N) Main focus Data repositories address
Note: The papers are sorted by their appearance in this review and divided into different categories based on their research field.
3.1 Sequence analysis long-range contexts (Tang et al., 2018), as their capability to ex-
Biological sequence analysis, including DNA, RNA and protein se- tract local features is limited by the filter size. The RNN-based
quence analysis, represents one of the fundamental applications of models (e.g. LSTM and GRU) are developed to capture long-
computational methods in molecular biology. Traditional sequence range dependency; however, it is difficult for them to perform
analysis methods rely heavily on k-mers frequency (Koonin and large-scale learning due to their limited degree of parallelization.
Galperin, 2003b), which is not able to capture distant semantic In addition, existing models generally require large amounts of
relationships of gene regulatory code. Deep learning models like labeled data, which is difficult to obtain in bioinformatics research
CNN also have problems capturing semantic dependency within (Butte, 2001).
10 S.Zhang et al.
Fig. 8. Several typical models of Transformer applied to bioinformatics including the frameworks of (A) DNABERT, (B) TransEPI, (C) Enformer, (D) TALE, (E) Hist2ST and
(F) ViT-V-Net
12 S.Zhang et al.
3.2 Genome analysis prediction of mutation effects through direct mutation analysis and
Although sequence analysis contributes significantly to biological population eQTL studies (Liu et al., 2022).
discovery, genome analysis is also essential to capture the full reper- In addition to predicting the effect of non-coding DNA on gene
toire of information encoded in the genome (Koonin and Galperin, expression, transformer-based models have been widely used to pre-
2003a). Genome analysis explains the appearance of tumors or phe- dict cancer subtypes according to gene expression data. Gene trans-
notypes from the DNA level, including gene mutations, deletions, former used the multi-headed self-attention module to solve the
amplifications (Feuk et al., 2006) and epigenetic modifications (e.g. complexity of high-dimensional gene expression for joint classifica-
DNA methylation) (Nikpay et al., 2015; Portela and Esteller, 2010). tion of lung cancer subtypes (Khan and Lee, 2021). Compared with
Several scratch-trained methods based on the Transformer model traditional classification algorithms, the proposed model achieved
have been developed to this end. For example, Clauwaert et al. an overall performance improvement in all evaluation metrics, with
(2021) proposed a prokaryotic genome annotation method based on 100% accuracy and zero false-negative rates on most datasets.
the Transformer-XL neural network framework, which was scRNA-seq is a revolutionary technology in the life science field.
supervised training). The researchers trained two auto-regressive et al., 2018), SVM (Cortes and Vapnik, 1995) and MLP (Kothari
models (Transformer-XL and XLNet) and four auto-encoder models and Oh, 1993), it provided a high degree of interpretation of the
(BERT, ALBERT, ELECTRA and T5) on large-scale protein sequen- results, as the attention of GTN could identify potential targeting
ces and tested both residue-level (3-state accuracy Q3 ¼ 81–87%) pathways and biomarkers, which is almost impossible to be
and protein-level (10-state accuracy: Q10 ¼ 81%, 2-state accuracy achieved by other models. DeepMAPS was a deep learning-based
Q2 ¼ 91%) prediction tasks using the embeddings obtained from single-cell multi-omics data analysis platform that utilized the het-
the language models above, and found that ProtT5 fine-tuned on erogeneous graph transformer framework to infer cell type-specific
UniRef50 without MSA outperformed ESM-1b and achieved the single-cell biological networks (Ma et al., 2021). DeepMAPS can in-
best performance. clude all cells and genes in a heterogeneous graph to infer cell–cell,
Other transformer-based pre-trained models have also been gene–gene and cell–gene relationships simultaneously.
widely used in proteomics research. ProteinBERT is a model specif-
ically designed for proteins (Brandes et al., 2022). The pre-training
were trained on free text, this model was characterized by using the experimental result was not state-of-the-art, ChemBERTa could
International Classification of Diseases (ICD) codes. After fine- scale the pre-training dataset well, with powerful downstream per-
tuning experiments on pancreatic cancer prediction and heart failure formance and practical attention-based visualization modality. K-
prediction in diabetic patients, Med-BERT was validated to be gen- BERT (Wu et al., 2022) presented new pre-training strategies that
eralized on different sizes of fine-tuned training samples, which can allowed the model to extract molecular features directly from
better meet disease prediction research with small training datasets. SMILES. The atomic feature prediction task enabled K-BERT to
Another promising application based on biomedical text data is an learn the initial atomic information that was extracted manually in
ALBERT-based model called InferBERT to predict clinical events graph-based approaches, the molecular feature prediction task
and infer the causality (Wang et al., 2021), which is a prerequisite enabled K-BERT to learn the molecular descriptor/fingerprint infor-
for deployment in drug safety. As evaluated on two FDA Adverse mation that was extracted manually in descriptor-based approaches,
Event Reporting System cases, the results showed that the number and the contrastive learning task enabled K-BERT to better ‘under-
of causal factors identified by InferBERT for analgesics-related acute stand’ SMILES through making the embeddings of different
there is heterogeneous information, including text, code, graphs, etc. 4.3 Model interpretability
To fully capture the information in these heterogeneous data, both A common criticism of deep learning models is their lack of inter-
in-depth data preprocessing and model adaption may be needed. For pretability. However, the model interpretability analysis is particu-
instance, biological sequence and genomic feature information is larly vital when the dimension of original features is too high.
generally textual, e.g. in FASTQ, BED and SRA formats. Such data Especially in the field of bioinformatics, gaining insight from the
can be directly fed to the transformer by word embedding or charac- model is critical since having an interpretable model of a biological
ter embedding techniques (Chen et al., 2022b; Ji et al., 2021; Rives system may lead to hypotheses that can be validated experimentally.
et al., 2021); patient visit information (including disease, medication The self-attention mechanism in Transformer has notable advan-
and clinical records) is represented as sequences of codes, such as tages in this direction. For example, through the analysis of atten-
EHR, ICD, where the code sequences are mapped to vector sequen-
tion maps, DNABERT (Ji et al., 2021) could visualize important
ces in the application (Li et al., 2020; Meng et al., 2021; Rasmy
areas that contributed to model decision-making, thereby improving
et al., 2021); the biomedical field involves images that are generally
Funding Chen,J. et al. (2021a) TransUNet: transformers make strong encoders for med-
ical image segmentation. arXiv, arXiv:2102.04306v1, https://arxiv.org/abs/
This work was supported by National Natural Science Foundation of China 2102.04306v1, preprint: not peer reviewed.
[62003178]; and the National Key Research and Development Program of Chen,J. et al. (2021b) ViT-V-Net: vision transformer for unsupervised volu-
China [2021YFF1001000]. metric medical image registration. arXiv, arXiv:2104.06468v1, https://
arxiv.org/abs/2104.06468v1, preprint: not peer reviewed.
Conflict of Interest: none declared. Chen,K. et al. (2022b) Capturing large genomic contexts for accurately pre-
dicting enhancer-promoter interactions. Brief. Bioinform., 23, bbab577.
Chen,Y.-C. et al. (2020) UNITER: UNiversal Image-TExt Representation
References learning. In: Vedaldi,A. et al. (ed.) Computer Vision – ECCV 2020, Lecture
Notes in Computer Science. Springer International Publishing, Cham,
Adel,H. et al. (2018) Overview of character-based models for natural language
pp. 104–120.
processing. In: Gelbukh,A. (ed.) Computational Linguistics and Intelligent
Howard,J. and Ruder,S. (2018) Universal language model fine-tuning for text Lee,J. et al. (2020) BioBERT: a pre-trained biomedical language representa-
classification. In: Proceedings of the 56th Annual Meeting of the tion model for biomedical text mining. Bioinformatics, 36, 1234–1240.
Association for Computational Linguistics (Volume 1: Long Papers). Lee,K. et al. (2021) A transformer architecture based on BERT and 2D convo-
Association for Computational Linguistics, Melbourne, Australia, lutional neural network to identify DNA enhancers from sequence informa-
pp. 328–339. tion. Brief. Bioinform., 22, bbab005.
Huang,K. et al. (2021) MolTrans: molecular interaction transformer for Li,H. et al. (2022) KPGT: knowledge-guided pre-training of graph transformer
drug-target interaction prediction. Bioinformatics, 37, 830–836. for molecular property prediction. In: Proceedings of the 28th ACM
Iuchi,H. et al. (2021) Representation learning applications in biological SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22.
sequence analysis. Comput. Struct. Biotechnol. J., 19, 3198–3208. Association for Computing Machinery, New York, NY, USA, pp. 857–867.
Ji,Y. et al. (2021) DNABERT: pre-trained Bidirectional Encoder Li,Y. et al. (2020) BEHRT: transformer for electronic health records. Sci.
Representations from Transformers model for DNA-language in genome. Rep., 10, 7155.
Bioinformatics, 37, 2112–2120. Lin,T. et al. (2022) A survey of transformers. AI Open, 3, 111–132.
International Conference on Machine Learning - Volume 28, ICML’13. Tang,G. et al. (2018) Why self-attention? A targeted evaluation of neural ma-
JMLR.org, Atlanta, USA, pp. III-1310-III–1318. chine translation architectures. In: Proceedings of the 2018 Conference on
Petroni,F. et al. (2019) Language models as knowledge bases? In: Proceedings Empirical Methods in Natural Language Processing. Association for
of the 2019 Conference on Empirical Methods in Natural Language Computational Linguistics, Brussels, Belgium, pp. 4263–4272.
Processing and the 9th International Joint Conference on Natural Language Tao,Y. et al. (2020) From genome to phenome: predicting multiple cancer phe-
Processing (EMNLP-IJCNLP). Association for Computational Linguistics, notypes based on somatic genomic alterations via the genomic impact trans-
Hong Kong, China, pp. 2463–2473. former. In: Pacific Symposium on Biocomputing. Vol. 25, Big Island of
Ponting,C.P. and Birney,E. (2005) Protein sequence analysis and domain iden- Hawaii, USA, pp. 79–90.
tification. In: Walker,J.M. (ed.) The Proteomics Protocols Handbook. Tsujii,J. (2021) Natural language processing and computational linguistics.
Springer Protocols Handbooks, Humana Press, Totowa, NJ, pp. 527–541. Comput. Linguist., 47, 707–727.
Portela,A. and Esteller,M. (2010) Epigenetic modifications and human dis- Turian,J. et al. (2010) Word representations: a simple and general method for
ease. Nat. Biotechnol., 28, 1057–1068. semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the
Zhang,L. et al. (2021b) BERT-m7G: a transformer architecture based on BERT for Computational Linguistics. Association for Computational Linguistics,
and stacking ensemble to identify RNA N7-Methylguanosine sites from se- Florence, Italy, pp. 1441–1451.
quence information. Comput. Math. Methods Med., 2021, 7764764. Zhao,G. et al. (2019) Explicit sparse transformer: concentrated attention
Zhang,Q. et al. (2017) A review on entity relation extraction. In: 2017 Second through explicit selection. arXiv, arXiv:1912.11637v1, https://arxiv.org/
International Conference on Mechanical, Control and Computer abs/1912.11637v1, preprint: not peer reviewed.
Engineering (ICMCCE), Harbin, China, pp. 178–183. Zheng,R. et al. (2021) Fused acoustic and text encoding for multimodal bilin-
Zhang,S. et al. (2020) TensorCoder: dimension-wise attention via tensor rep- gual pretraining and speech translation. In: Proceedings of the 38th
resentation for natural language modeling. arXiv, arXiv:2008.01547v2, International Conference on Machine Learning, PMLR, Virtual, pp.
https://arxiv.org/abs/2008.01547v2, preprint: not peer reviewed. 12736–12746.
Zhang,Z. et al. (2019) ERNIE: enhanced language representation with inform- Zitnik,M. et al. (2018) Modeling polypharmacy side effects with graph convo-
ative entities. In: Proceedings of the 57th Annual Meeting of the Association lutional networks. Bioinformatics, 34, i457–i466.