Nothing Special   »   [go: up one dir, main page]

A Comprehensive Survey of Deep Learning Techniques in Protein Function Prediction

You are on page 1of 11

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO.

3, MAY/JUNE 2023 2291

A Comprehensive Survey of Deep Learning


Techniques in Protein Function Prediction
Richa Dhanuka , Jyoti Prakash Singh , Senior Member, IEEE, and Anushree Tripathi , Member, IEEE

(Survey/Tutorial Paper)

Abstract—Protein function prediction is a major challenge in shows the categorisation of various techniques being developed
the field of bioinformatics which aims at predicting the functions to predict protein fucntions. These include analysis of various
performed by a known protein. Many protein data forms like types of protein data such as protein sequences [1], [2], protein-
protein sequences, protein structures, protein-protein interaction
networks, and micro-array data representations are being used to protein interaction data [3], [4], [5], protein structures [6], [7],
predict functions. During the past few decades, abundant protein gene expression data [8] and hybrid data [9], [10]. In compar-
sequence data has been generated using high throughput tech- ison to other protein data sources, protein sequences have the
niques making them a suitable candidate for predicting protein advantage of being readily accessible, thus making them an
functions using deep learning techniques. Many such advanced
ideal choice for identifying the functions of proteins. This survey
techniques have been proposed so far. It becomes necessary to
comprehend all these works in a survey to provide a systematic focusses on protein sequence based PFP techniques.
view of all the techniques along with the chronology in which the Initial developments for PFP were homology-based transfer
techniques have advanced. This survey provides comprehensive where the functions of a known protein were transferred to a
details of the latest methodologies, their pros and cons as well as query protein if the two proteins share significant sequence sim-
predictive accuracy, and a new direction in terms of interpretability ilarities. However, with the explosion of protein data, homology-
of the predictive models needed to be ventured by protein function
prediction systems. based methods started diminishing in utility due to the increasing
number of distant and orphan proteins. To compensate, many
Index Terms—Bioinformatics, deep learning, feature extraction, computational approaches are being developed using sequences,
protein function prediction, protein sequence.
structures, protein-protein interactions, biochemical and physic-
ochemical properties, protein domain, and motifs that target
I. INTRODUCTION non-homologous and distant proteins for function predictions.
HE value of proteins for an organism is substantial due In the last few years, most of the developments in the field
T to their involvement in various crucial functions like
metabolism, reproduction, etc inside an organism. Knowledge
of PFP are by using data-driven machine learning techniques,
especially deep learning methods. In this comprehensive survey,
about their functions helps in the discovery of new potential a review of recent advancements in PFP is discussed, along
drugs, treatment of diseases, growth of better crops, and even with their contributions and outstanding challenges. Existing
the development of biofuels. The success of next-generation se- reviews which we found are mainly focussed on hand-crafted
quencing techniques exploded the number of novel proteins cre- features [11], [12] and machine learning algorithms [13], [14].
ating an enormous gap between number of sequenced proteins This review paper seems to be timely as none of the other
and their corresponding annotations as the protein sequences are reviews consolidates the deep learning-based approaches for
available tremendously. Protein function prediction (PFP) has predicting protein functions. Also, an assessment of the benefits
become a major challenge in the field of bioinformatics due to and drawbacks of various feature representations and different
the abundance of proteins and lack of knowledge about their cor- function prediction systems was conducted in this survey. The
responding functionalities. This knowledge gap can be attributed focus of all the discussed prediction systems is to maximize
to complex manpowered expensive laboratory procedures that the predictive accuracy of the system. The black-box training
lead to the entry of computational approaches for predicting of the system does not provide any insight into the data and
protein functions. The last few decades have been dedicated by the predictions associated with it deterring the confidence of
the research community in this direction and many remarkable biologists. In this survey, we give an insight into the importance
approaches and improvements have been developed so far. Fig. 1 of interpretability of such predictive models.
The rest of the paper is organized as follows: Section II
Manuscript received 9 July 2022; revised 29 December 2022; accepted 16
gives a brief discussion of homology-based methods and their
February 2023. Date of publication 22 February 2023; date of current version 5 detailed shortcomings which led to the development of new
June 2023. (Corresponding author: Richa Dhanuka.) approaches i.e. model-based approach which is discussed in
The authors are with the Department of Computer Science and Engineering,
National Institute of Technology Patna, Patna, Bihar 800005, India (e-mail:
detail in Section III. Section IV discusses the integration of
richa.dhanuka@gmail.com; jps@nitp.ac.in; anushree.cs@nitp.ac.in). homology-based methods and model-based methods in the form
Digital Object Identifier 10.1109/TCBB.2023.3247634 of hybrid methods that utilizes the advantages of both methods.

1545-5963 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
2292 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023

Fig. 1. Major techniques of protein function prediction.

Section V discusses the models that consider protein functional identified based on the similarity of known protein (whose
interrelationships to improve the predictability of the models. function was experimentally annotated) with the query protein.
Section VI discusses the importance of interpretability of the The similarity is defined using a similarity score simi where
predictive models. Lastly, Section VII concludes this review and i = 1, 2, 3, . . ., n, n is the number of known proteins in the
suggests future directions. database. When the system finds a known protein with a high
similarity score to the query protein, the functions of a known
similar protein are transferred to the query protein.
II. HOMOLOGY-BASED METHODS The Smith-Waterman algorithm, BLAST (Basic Local Align-
The earliest approach to protein function prediction is ment Search Tool) [15], [16] is a common example of this
homology-based in which the functions of a homologous protein approach. The E-value calculated by BLAST considers inser-
are transferred to the query protein. Homologous protein is tions, deletions, matches, and mismatches in the two comparing

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
DHANUKA et al.: COMPREHENSIVE SURVEY OF DEEP LEARNING TECHNIQUES IN PROTEIN FUNCTION PREDICTION 2293

proteins and provides an insight into the extent of similarity homology-based transfer, there is no need for comparing the
between two proteins. A low E-value indicates high similarity query sequence with each training sequence.
between the two protein sequences and the functions of known An effective machine learning model is mainly dependent
protein is transferred to the query protein. on three factors: 1) efficient feature representation, 2) effective
One important advantage of these methods is the simplicity algorithm, and 3) adequate training samples. After the growth
of the training phase. The training phase includes storing known of next-generation sequencing technologies, abundant protein
proteins along with their functions. The testing phase includes sequences are available leading to adequate training samples
the learning which identifies the most similar protein sequence for modeling PFP models. Most of the recent research has
to the query sequence and transfers the annotation from former diversified focus on either efficient feature representation or
to latter. As and when a new known protein sequence arrives, it developing a powerful algorithm. Although an efficient model
is added to the database to test the query sequences against it. requires a balance of both the factors, the subsequent subsections
These methods’ accuracy and complexity are dependent on the emphasize on each of the two factors and give a detailed survey.
size of the training dataset. The models’ performance is measured in terms of F1-score: the
Although these methods are simple and advantageous, they harmonic mean between precision and recall, Fmax : the maxi-
have a few limitations that necessitated the development of other mum protein-centric F1-score calculated over all thresholds, and
computational approaches. First, the two proteins with similar area under precision-recall curve (AUPR) that summarizes the
sequences do not necessarily perform similar functions, and precision-recall curve.
proteins that perform similar functions do not necessarily have
the same sequences [17], [18]. In [19], the researcher showed
that enzymatic functions are not certainly conserved with high A. Feature Representation-Based Methods
sequence similarity. Second, these methods do not take into Protein sequences are comprised of various combinations
account protein functions while computing similarity scores of twenty lettered amino acids. These representations of pro-
between proteins. Hence, they do not consider domain similarity tein sequences in the form of alphabetical strings cannot be
which is an important indicator for function-function similarity. directly utilized by the machine learning algorithms without
Rather, they compare the whole sequence which has more fillers being converted into a digital/numerical form. The numerical
than meaningful information. In such a case, a sequence with representation of the protein sequences is the feature with which
high similarity should not be considered to transfer the functions a machine learning model is trained. The process of converting
to the target sequence [18], [20], [21]. Third, it is difficult to protein sequences into their numerical form is termed feature
annotate functions to those proteins for which no homologous extraction. The majority of models used for PFP are trained
sequence is found in the database as per the expected E-value cut- by using features generated from protein sequences [11], [24],
off. For example, when A. thaliana genome was sequenced, 30% [25], [26], [27], [28], [29], [30]. This section is focused on the
of the sequence was not assigned any function as there was no importance of various feature extraction techniques in protein
similar sequence in the database [22]. Fourth, homology-based function prediction.
transfer of protein functions does not consider biochemical 1) Natural Language Processing Based Feature Extractions:
properties of the sequence which has a significant relationship Since protein sequences are similar to human languages, but
with the protein functions [23]. Finally, the high time complexity without specific semantics [31], various natural language pro-
involved during the testing phase in which the test sequence is cessing (NLP) techniques have been developed for encoding
compared to each sequence in the database opens up a new space protein sequences. Table I tabulates literatures on NLP-based
for predicting functions. All these limitations are the motivations feature extraction techniques. Using the NLP based Word2Vec
to look for a new arena for predicting protein functions. The models [32] Asgari et al. [31] created ProtVec embeddings for
next section discusses the model-based approaches for protein protein sequences. ProtVec is an n-dimensional vector com-
function prediction that overcomes the limitations imposed by prising an unsupervised data-driven distributed representation
homology-based methods. of protein sequences. In terms of protein family classification,
ProtVec produced outstanding results and has its applicability in
other bioinformatics-led investigations like disordered protein
III. DEEP LEARNING-BASED METHODS identification, structure prediction, protein-protein interaction
The machine learning algorithms learn abstract patterns from prediction, and protein visualization.
the training data creating a model that can classify test data. In Du et al. [33] have utilized NLP-based Word2Vec [32] method
contrast to the earlier homology-based approach, the high time to define k-mer word embeddings of protein sequences for
complexity of the testing phase shifts to the training phase in predicting functions. It includes sequence similarity profiles
this approach. The training phase creates a classification model (SSP) that exploits functional relationship across all the levels
from the training data which can later be directly used to classify of similarity. They have incorporated PPI network data with a
a new unseen test data. The trained classification model here modification on SSP features. If the PPI data is empty, then
is supposed to capture significant patterns related to functions the SSP features were used as supplementary features that
from the training data which has proteins with known functions improved the performance of the model. In [34], researchers
and predicts the functions of unknown proteins. Hence, unlike have proposed that simple NLP-based encodings for protein

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
2294 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023

TABLE I
LITERATURES ON FEATURE EXTRACTION BASED ON NATURAL LANGUAGE PROCESSING

sequences such as one-hot encoding and random embeddings vector. Amino acid-specific information is crucial for residue-
achieve similar performance to pre-trained embeddings. Wan level predictions like protein fold recognition. Many researchers
and Jones [29] proposed FFPred-GAN, that uses generative have proposed encodings like one-hot encoding (a vector that
adversarial networks (GAN) to generate synthetic proteins. is all 0 s except a single 1), physicochemical properties-based
This data augmentation technique is capable of learning high- encodings, encodings based on the position-specific scoring
dimensional distributions of features based on protein sequences matrix, evolution-based encoding, machine learning-based en-
and improved the predictability of protein functions on various coding [38], etc. There are existing surveys [11], [12] that pro-
classifiers such as k-nearest neighbour (kNN), support vec- vide a comprehensive review and assessment of these encoding
tor machine (SVM) and random forests (RF). Integration of schemes. These surveys can be referred for further understand-
FFPred-GAN with NetGO [35] achieved maximum predictive ing of the amino acid encoding.
performance. The augmentation improved prediction accuracy, 2) Sequence Segmentation Based Feature Extraction: Since
but the increased sample size due to augmentation led to a protein sequences are of varying lengths ranging from 30 to
relative increase in computational cost. Villegas-Morcillo et 30000. Several sequence segmentation techniques are being
al. [36] have utilized the abundantly available unannotated developed to represent varying length protein sequences into
proteins for creating the embeddings. They proposed an un- equal-sized feature vectors. Ko et al. [39] proposed a deep
supervised protein embedding technique to create the features learning program called “FUTUSA” (Function Teller Using
from unannotated protein sequences. These features are later Sequence Alone) that applies sequence segmentation to extract
fed to a supervised model to predict functions. They gave out- features using CNN. They divided the protein sequences into
standing performance using a two-layer perceptron model. On variable-sized segments and trained the model. Segmentation
the similar concept of utilizing abundant unannotated protein of sequences substantially improved the predictive performance
sequences, Mansoor et al. [37] proposed GOGAN, a genera- and can distinguish between functionally essential amino acids
tive adversarial network-based model that utilizes unannotated with non-essential ones that can help in predicting the impact
protein sequences to create embeddings and predicts protein of point mutations in disease progressions. Though their work
functions through the knowledge acquired during embedding is focused on specific target functions and binary classification,
creation. Maria et al. [28] proposed transferring the annotations one of the major advantages of their proposed work is that it can
based on the similarity between protein sequence embeddings detect functional motifs and can predict mutation sites too.
generated through language models like SeqVec and ProtBert, Ranjan et al. [40] proposed a sequence segmentation tech-
instead of protein sequence similarity. The idea is based on nique ProtVecGen to represent a protein sequence in the form
the notion that embeddings generated through language models of features. The segmentation approach increased the visibility
can capture relevant information for function prediction. Rather of conserved regions by reducing the dominance of fillers or
than annotating the unknown proteins to the functions of similar non-conserved regions, improving the prediction accuracies for
proteins, they annotated the unknown proteins with the functions long protein sequences. Another feature set ProtVecGen-Plus
of the proteins whose distance in embedding space is minimum. was created with a combination of features from different sizes of
They achieved promising results on the CAFA3 targets. segments (multi-sized segments) leading to more discriminative
Another prevalent way of encoding protein sequences is features and improved the overall prediction accuracy.
by encoding each amino acid of protein sequences into an In another work, Ranjan et al. [41] proposed ProtVecGen-
n − dimensional feature vector. So, a protein sequence of ensemble an ensemble-based approach that utilizes the en-
length m will be represented as a feature vector of size n ∗ m. It is sembling of three models trained on datasets segmented in
different from protein sequence encoding methods where a com- three different ways. They explored variants of tf-idf (namely
plete sequence is represented into an n − dimensional feature log-normalized-tf-idf and length-normalized-tf-idf) to conquer

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
DHANUKA et al.: COMPREHENSIVE SURVEY OF DEEP LEARNING TECHNIQUES IN PROTEIN FUNCTION PREDICTION 2295

TABLE II
LITERATURES ON FEATURE EXTRACTION BASED ON SEGMENTATION APPROACH

problems like over-normalization, n-mer localization, and small protein domains, protein sequence information, protein subcel-
pattern regions in protein sequences. They showed that their lular location, and protein network data to predict functions that
ensemble framework has achieved overall superior performance enhanced the performance of the model. In [47], Zhang et al.
and consistent results for varying length sequences. proposed DeepFunc, a deep learning framework that predicts
Ranjan et al. [42] developed a multi-attention multi-aspect protein functions from protein sequences and PPI network data.
model for sub-sequence classification and protein function pre- They utilized Deepwalk algorithm [51] to extract topological
diction. They have shown that the single attention mechanism of features from PPI data and integrated them with the features
NLP cannot work well for protein sequences as the semantics of extracted from protein sequences to generate a set of high-quality
protein words are not as clear as natural languages. They showed features that are fed to a deep neural network.
that a multi-attention model could associate high significant pro- Agrawal et al. [52] proposed a hybrid feature vector that
tein words in a sub-sequence with different functional domains. is an ensemble of shallow and deep features to train a deep
They also showed the importance of label relevance in PFP using learning framework to predict protein functions. Shallow fea-
sub-sequences as a sub-sequence can be more relevant/related tures are generated using pseudo-amino acid composition
to a label/function than other labels. They showed that their and pseudo-position-specific scoring matrix. Deep features are
proposed model gave an outstanding performance on CAFA3 generated using a one-hot encoding and convolution neural
datasets. Table II highlights few literatures based on sequence network.
segmentation techniques. Giri et al. [48] proposed MultiPredGO, a deep learning-
3) Integrated Feature Extraction: Integration of features based multi-modal protein function prediction technique where
from various proteomic data sources is also being re- the researchers have amalgamated protein sequences, structure,
searched [43], [44], [45], [46], [47], [48]. In [43], the authors and protein interaction network. Two independent CNN-based
have utilized heterogeneous protein networks for extracting models were developed to extract features from sequences and
features. A multi-modal deep autoencoder-based model was structures respectively. For interaction information, a knowledge
used, in which separate layers were used for different networks embedding of size 256 is generated. All these extracted features
and later combined into one bottleneck layer. In this approach, were used to train a hierarchical classification model that can
heterogeneous protein interaction networks are integrated into predict protein functions. They showed that the integration of
a low-dimensional feature representation that is common to all 3D structure data enhanced the performance of the model. They
networks. Cai et al. [49] proposed SDN2GO, an integrated deep used a voxel representation of 3D data to be fed to ResNet-50
learning model which takes features from protein sequences, to extract features. These informative representations gave their
protein domain, and PPI network data to predict protein func- model better learning that enhanced the overall performance of
tions. Three independent neural network models are trained to the system.
learn features from protein sequences, protein domains, and You et al. [53] proposed DeepText2GO which integrates
PPI networks respectively. They have used CNN along with sequence homology-based information and text-based infor-
an embedding layer for the sequence sub-model and domain mation to predict protein functions through a consensus ap-
sub-model, while a neural network for the PPI network sub- proach. For sequence homology-based information, they used
model. The output of these three sub-models are concatenated BLAST-kNN to get a similarity score. For text-based infor-
together and used as an input to a non-fully connected weighted mation, DeepText2GO incorporates NLPs bag of words (tf-
neural network to predict protein functions. The models were idf) and deep semantic representations (word2vec, doc2vec) of
trained on a newly constructed dataset based on the time delay text from MEDLINE citation abstracts. They used a consensus
principle of CAFA. They showed a huge impact of protein approach that integrates the prediction scores from multiple
domain knowledge in the improvement of the predictability methods to improve the predictive performance. Their exper-
of protein functions. Fan et al. [50] proposed Graph2GO, a iments showed that DeepText2GO significantly outperformed
multi-modal graph-based architecture using feedforward neural both the text-based and sequence homology-based methods.
network. They integrated various types of protein data such as You et al. [2] proposed a learning to rank (LTR) framework

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
2296 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023

TABLE III
LITERATURES BASED ON INTEGRATED FEATURE EXTRACTION

GOLabeler that uses sequence homology and protein family to propose multi-modal CNN-based model to predict the pro-
information from BLAST and InterProScan respectively. GOLa- tein functions. They used protein sequences and protein-protein
beler initially ranked the prospective GO terms for a given query interaction data to train the model. The model learns features
protein and later, returns the top-ranked GO terms as a predicted from protein sequences using CNN and the embedding layer.
term. It was the best performing model in CAFA3. You et al. [35] These features were further concatenated with the knowledge
proposed another learning to rank framework NetGO which is embeddings of the protein-protein interaction data to be fed to a
an updated version of GOLabeler that integrates protein network classifier. As a classifier, they have used a hierarchical classifi-
information from the STRING database along with various types cation layer based on the structure of each ontology that refines
of information from protein sequences to predict all possible GO features. They achieved an F 1-score of 0.395 for BP, 0.47 for
terms, resulting in improved performance. In the updated version MF, and 0.633 for CC on data from Swiss-Prot and STRING
of NetGO 2.0, Yao et al. [46] integrated literature information databases. In addition to being complex and resource-intensive,
using logistic regression and more sequence information using their work had other significant limitations such as prediction of
RNN, resulting in further improvement in the performance of functions for protein sequences with length less than 1002, use
the model. Table III shows the gist of literatures on integrated of interaction network that are unavailable for many proteins.
features-based predictions. Later, Kulmanov and Hoehndorf [61] proposed DeepGOPlus
which is an extended and improved DeepGO method. They
used 1-D CNN over protein sequences with multiple filters
B. Learning Algorithm-Based Methods of variable size and showed that the CNN filters have learned
In recent years, deep-learning techniques have gained popu- patterns similar to domain motifs. The CNNs predictions are fur-
larity in the field of protein function prediction, since they can ther integrated with sequence similarity to capture meaningful
learn from a wide range of data [53], [54], [55], [56], [57], [58], pieces of information that led to more accurate predictions. In
[59]. In addition to feature extraction techniques, many other DeepGOPlus, they removed the hierarchical classification layer
deep-learning techniques, including recurrent neural networks that made the system faster as compared to their previous work
(RNNs), convolutional neural networks (CNNs), deep neural DeepGO. DeepGOWeb [66] is a semantic web-based service for
networks (DNNs), autoencoders, etc., have been proposed [35], DeepGOPlus that can be accessed through the website, REST
[49], [58], [60], [61], [62], [63], [64]. Table IV tabulates the API, and a SPARQL query. DeepGOWeb generates the predicted
summary of literature focussing on the learning algorithms to functions and the corresponding confidence score as an output
develop a better PFP system. In one such study [65] researchers for an input protein sequence.
proposed ProLanGO where a protein function prediction is seen The applicability of deep neural networks in PFP is high-
as a language translation problem and built a neural machine lighted by many researchers as PFP is structured as a multi-label
language translation model using a recurrent neural network classification problem [1], [67]. In [68], authors investigated
where a protein sequence was considered one language, and the usefulness of multi-task deep neural networks (MTDNN)
the corresponding functions were considered another language. consisting of shared layers upon which task-specific layers
They evaluated their model on CAFA3 datasets and achieved are stacked parallelly. This architecture can learn the shared
an AUC of 0.333. Kulmanov et al. [10] were among the first feature representations from all GO terms as well as specific

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
DHANUKA et al.: COMPREHENSIVE SURVEY OF DEEP LEARNING TECHNIQUES IN PROTEIN FUNCTION PREDICTION 2297

TABLE IV
RELEVANT LITERATURES FOCUSSING ON LEARNING ALGORITHMS FOR PFP

characteristics from individual GO terms giving it an edge over [72]. Cao et al. [70] proposed TALE, a transformer-based protein
other baseline methods, especially for orphan proteins. function annotation tool that is trained on joint embeddings
Ioannidis et al. [63] proposed a graph neural network (GNN) of protein sequences and protein function labels. The model
based approach wherein multi-relational protein-protein net- provides generalizability to novel sequences and unseen labels.
work data is used to predict protein functions. The graph network To achieve the aforementioned, it utilizes the self-attention
data consists of nodes representing proteins, edges representing mechanism along with the protein function label embeddings
relations, multiple edges between a pair of nodes representing that highlight global patterns in the sequences and highlight
multiple relations, and each node is associated with a label the sequence-function relationship. Qiu et al. [71] proposed
representing protein functions. Only a subset of nodes are as- TALE-cmap which is an extension of the TALE-based archi-
sociated with labels hence, the problem at hand is posed as tecture that integrates protein structure contact maps along with
a semi-supervised learning task. A GNN-based architecture is sequence and label embeddings. TALE-cmap improves the per-
developed that weighs the different relations through learnable formance of TALE showing the importance of structures in the
parameters and predicts the functions. Lai et al. [59] proposed prediction of protein functions. Brandes et al. [72] proposed
GAT-GO, a graph attention network-based approach that uti- self-supervised and transfer learning architecture ProteinBERT,
lizes protein sequence embeddings and predicted inter-residue a language model for protein sequences that includes GO pre-
structure information to predict protein functions. The inclusion dictions. ProteinBERT was self-supervised as the model was
of predicted structure information significantly improved their trained to take corrupted protein sequences and GO annotations
model’s performance. Gligorijevic et al. [6] proposed Deep- as inputs and to recover uncorrupted data as output. Its global
FRI, another work based on a graph convolution network that attention layer enables it to process protein sequences of any
utilizes pre-trained sequence embeddings and 3D structures to length providing great scalability and efficiency to the model.
predict functions. Since the amount of sequence data is huge ProteinBERT is initially trained for GO annotation prediction,
as compared to structure data, they augmented the experimental however, it is shown that the model can be fine-tuned to perform
structures with homology models to perform the experiments. a variety of tasks like secondary structure prediction, disorder
DeepFRI is experimentally tested to be robust across model- identification, remote homology modeling, etc.
ing errors and outperformed various state-of-the-art models. Dhanuka et al. [73] proposed a semi-supervised autoencoders-
PANDA2 [69] proposed by Zhao et al. utilizes graph neural based deep learning technique to predict protein functions.
networks to model the hierarchical relationships of GO and The autoencoders were trained using protein sequences in a
integrates various sequence features like DIAMOND scores, semi-supervised manner by preserving the information about
PSI_BLAST scores, priority scores, pseudo-amino acid compo- the function for which it was trained. The proposed model
sition, and the language models based features to predict protein was trained in two stages: a) Stage-1 and b) Stage-2. As part
functions. of stage-1, autoencoders were trained using protein sequences
Many variations of transformer-based architectures have been so that each autoencoder was trained only on the protein se-
proposed by researchers to predict protein functions [70], [71], quences corresponding to a particular function. Training of these

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
2298 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023

autoencoders is not necessary for any additional new function, computational forms that can be further analyzed to predict
and the trained autoencoders can be used by any classifier. In protein functions. Makrodimitris et al. [84] have used Label-
the event of any additional function, a new autoencoder will Space Dimensionality Reduction technique to transform the GO
be trained. Stage-2 used the reconstruction losses calculated terms (with redundancy) into a compact latent representation. In
through pre-trained autoencoders to classify proteins into their another work, Ahmed et al. [85] have used overlapping number
respective protein functions. Each protein sequence was passed of proteins and interactions over protein clusters to determine the
through the pre-trained autoencoders and reconstruction losses interrelationship of functions. They performed their experiments
were calculated for every autoencoder to train a dense sigmoid on yeast proteome and improved the accuracy of predicting
layer classifier with input as the reconstruction losses and output protein functions. Incorporating inter-relationships of functions
as the corresponding protein functions. Various experiments into any model reduces the model’s complexity by reducing
were conducted with different protein data including sequences, the redundancy among functions, reducing the ambiguity in the
protein interaction data, and features based on physicochemical model training.
properties of amino acids and it was found that the diversified Based on the similar concept of reducing the model’s ambigu-
information about proteins adds value to better learning of ity due to redundant functions, Dhanuka snd Singh [67] proposed
the model. The proposed model worked best when integrated a multi-layer perceptron (MLP) based model which incorporates
features were used with Fmax = 0.4216, AU P R = 0.40 for protein function inter-relationships to find functions performed
BP and Fmax = 0.4749, AU P R = 0.43 for MF. In addition, by protein sequences. They proposed approaches to remove
experiments with standalone pre-trained autoencoders showed the redundant/similar functions along with a way to get the
that individual autoencoders can predict the corresponding func- predictions for removed redundant functions. Two statistical ap-
tions for which they were trained, making them scalable to any proaches Pearson’s correlation coefficient and Jaccard similarity
number of functions while minimizing the effort of retraining the coefficients were used to define inter-functional relationships
model. Utilizing autoencoders, Miranda and Hu [74] proposed among functions and redundant (closely inter-related) functions
a deep-stacked denoising autoencoders-based model to extract were removed from the set of functions. The MLP model with a
features from protein sequences. They have used multi-labeled single hidden layer is trained for the reduced set of functions. For
SVMs as classifiers to classify Yeast [75] and Genbase [76] recreating the functions, two approaches 1) Direct Mapping and
datasets. 2) Ensemble approach was proposed to make a fair evaluation of
Many machine learning models like support vector machines, the proposed system as a model tends to perform well with fewer
hidden markov models, and decision trees, are being used to functions. Direct Mapping populates the prediction value for
predict protein functions. A review of these models is articulated redundant functions directly from the values of similar functions.
by Tiwari et al. [77]. Hence in this case the prediction probability for a protein will
be the same for all functions that are similar. The ensemble
IV. HYBRID METHODS approach uses an ensemble of MLPs to automate the mapping
of prediction probabilities. Several experiments were conducted
Homology-based methods and model-based methods can
and the results were analyzed from various perspectives. It was
jointly be used together to get the best of each. Many re-
found that the MLP model with input as multi-modal protein
searchers have proposed the hybrid approach that utilizes the
data and Jaccard similarity coefficient along with the ensemble
results of homology-based methods to improve the predictabil-
approach outperformed other models and was able to predict
ity of the model-based approach. In one such work, a rule-
specific functions too. The lack of adequate experimental anno-
based system was proposed to identify whether the sequence
tations for these specific functions makes them less predictive by
retrieved using the homology-based method is homologous
many computational models. Seyyedsalehi et al. [86] extracted
or not by ruling out the uncertainties of the results produced
semantic relationships between GO terms using a generative
by homology-based methods [23]. In another work [33], se-
adversarial networks-based deep learning model, PFP-WGAN.
quence similarity profiles are used as a feature to predict protein
Their experiments showed that incorporating semantic and ar-
functions.
chitectural similarities between GO terms improves the pre-
dictability of deeper specific GO terms.
V. PROTEIN FUNCTION INTER-RELATIONSHIPS Apart from GO hierarchy-based similarities, the co-
A further aspect which the PFP research community is paying occurrence of GO terms is analyzed to improve the performance
attention to is the interrelationship of functions [5], [78], [79], in PFP. Li et al. [87] proposed DeepPFP-CO, in which a graph
[80], [81], [82], [83]. Several research have been focussing convolution network (GCN) is used to explore the impact of
on the hierarchical structure of gene ontology (GO) in protein co-occurrence of GO terms on predicting the functions. Se-
function prediction. Kulmanov et al. [10] have used a hierar- quential features are extracted using a CNN and Bi-LSTM
chical classification layer in their proposed model to imitate network, concatenated with PPI network and subsequence-based
the hierarchical structure of GO. The inclusion of hierarchical features to be fed to the GCN layer to predict protein functions.
classification layer adds complexity to the model in various In the comparative study with a model without considering
terms such as training time of the model, model’s runtime, and the co-occurrence of GO terms, performance degradation was
the number of trainable parameters. Other approach for utilizing observed, highlighting the importance of co-occurrence of GO
interrelationship of functions is to transform the GO terms into terms in PFP.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
DHANUKA et al.: COMPREHENSIVE SURVEY OF DEEP LEARNING TECHNIQUES IN PROTEIN FUNCTION PREDICTION 2299

VI. IMPORTANCE OF INTERPRETABILITY OF MODELS by others. However, their inability to analyze the results and their
predictions open up a huge space for research in this direction
All the works discussed above have been influenced by a
similar concept of maximizing the predictive accuracy of the where a proper meaning and research analysis can be associated
with improving protein function predictions.
protein function prediction system. These methods are though
very effective in terms of predictability but are unable to give
meaningful biological insights about the data and prediction REFERENCES
to the biologists. Understandability of the predictions made
by the models helps in building biologists’ confidence in the [1] H. Wang, L. Yan, H. Huang, and C. Ding, “From protein se-
quence to protein function via multi-label linear discriminant analysis,”
predictions. If the black box predictions are accompanied by IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 14, no. 3, pp. 503–513,
some sort of interpretability of the predictions and the data, May/Jun. 2017.
then the predictions will have more meaning and weightage [2] R. You, Z. Zhang, Y. Xiong, F. Sun, H. Mamitsuka, and S. Zhu, “GOLa-
beler: Improving sequence-based large-scale protein function prediction
to the biologists leading them to invest more time and money by learning to rank,” Bioinformatics, vol. 34, no. 14, pp. 2465–2473, 2018.
in the corresponding wet laboratory investigations. Also, the [3] R. Sharan, I. Ulitsky, and R. Shamir, “Network-based prediction of protein
interpretable models can give new insights to the biologists about function,” Mol. Syst. Biol., vol. 3, no. 1, 2007, Art. no. 88.
[4] S. Moosavi, M. Rahgozar, and A. Rahimi, “Protein function prediction
the protein and the corresponding predictions [88]. using neighbor relativity in protein–protein interaction network,” Comput.
In [89], authors have addressed the importance of valida- Biol. Chem., vol. 43, pp. 11–16, 2013.
tion and interpretation of the predicted functions rather than [5] P. Sun et al., “Protein function prediction using function associations in
protein–protein interaction network,” IEEE Access, vol. 6, pp. 30 892–30
focusing only on the predictive accuracy of the system. This 902, 2018.
work discovers comprehensible protein prediction models which [6] V. Gligorijević et al., “Structure-based protein function prediction using
are accurate and interpretable increasing the confidence of the graph convolutional networks,” Nat. Commun., vol. 12, no. 1, pp. 1–14,
2021.
biologists in the predicted data which leads to new insights and [7] A. Roy, J. Yang, and Y. Zhang, “COFACTOR: An accurate comparative
hypotheses. A critical review of different data representations algorithm for structure-based protein function annotation,” Nucleic Acids
used to discover comprehensible protein prediction models is Res., vol. 40, no. W1, pp. W471–W477, 2012.
[8] M. Li, X. Wu, J. Wang, and Y. Pan, “Towards the identification of protein
presented. complexes and functional modules by integrating PPI network and gene
Interpretation of models is important in terms of finding expression data,” BMC Bioinf., vol. 13, no. 1, pp. 1–15, 2012.
and detecting any possible error in the data or the predictions. [9] R. You, S. Yao, H. Mamitsuka, and S. Zhu, “DeepGraphGo: Graph
neural network for large-scale, multispecies protein function prediction,”
Potential errors in the system can be due to sequences with Bioinformatics, vol. 37, no. Supplement_1, pp. i262–i271, 2021.
any GO annotations (without considering evidence codes), in- [10] M. Kulmanov, M. A. Khan, and R. Hoehndorf, “DeepGo: Predicting pro-
complete protein sequences, imbalanced data in terms of an- tein functions from sequence and interactions using a deep ontology-aware
classifier,” Bioinformatics, vol. 34, no. 4, pp. 660–668, 2017.
notations and sequence length, feature selection, and choice of [11] X. Jing, Q. Dong, D. Hong, and R. Lu, “Amino acid encoding meth-
algorithms. Few successful works [90], [91] are proposed on the ods for protein sequences: A comprehensive review and assessment,”
post-processing of the output from the predictive models. They IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 17, no. 6, pp. 1918–1931,
Nov./Dec. 2020.
are all rule-based systems (system based on human-made rules [12] N. Yu, Z. Li, and Z. Yu, “Survey on encoding schemes for genomic data
to store and manipulate the data) and are mainly focused on representation and feature learning—From signal processing to machine
specific domains and functions. Extending a rule-based system learning,” Big Data Mining Analytics, vol. 1, no. 3, pp. 191–210, 2018.
[13] Z. Lv, C. Ao, and Q. Zou, “Protein function prediction: From tradi-
for post-processing large-scale predictions is still a challenging tional classifier to deep learning,” Proteomics, vol. 19, no. 14, 2019,
problem. Art. no. 1900119.
[14] R. Bonetta and G. Valentino, “Machine learning techniques for protein
function prediction,” Proteins Struct., Function, Bioinf., vol. 88, no. 3,
VII. CONCLUSION pp. 397–413, 2020.
[15] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic
We presented a comprehensive review of the latest research local alignment search tool,” J. Mol. Biol., vol. 215, no. 3, pp. 403–410,
in the field of PFP using deep learning techniques. We high- 1990.
[16] S. F. Altschul et al., “Gapped BLAST and PSI-BLAST: A new generation
lighted the drawbacks of homology-based methods which are of protein database search programs,” Nucleic Acids Res., vol. 25, no. 17,
still the most reliable tools for PFP. Today is the era of machine pp. 3389–3402, 1997.
learning techniques and there are enormous reasons to utilize [17] J. A. Gerlt and P. C. Babbitt, “Can sequence determine function?,” Genome
Biol., vol. 1, no. 5, pp. 1–10, 2000.
these techniques for creating a better performing protein func- [18] I. Friedberg, “Automated protein function prediction—the genomic chal-
tion prediction system. We differentiated research on machine lenge,” Brief. Bioinf., vol. 7, no. 3, pp. 225–242, 2006.
learning based on feature-based techniques and algorithm-based [19] B. Rost, “Enzyme function less conserved than anticipated,” J. Mol. Biol.,
vol. 318, no. 2, pp. 595–608, 2002.
techniques by recognizing the focus of the work. Few works [20] B. Rost, J. Liu, R. Nair, K. O. Wrzeszczynski, and Y. Ofran, “Automatic
have utilized the advantages of both homology-based methods prediction of protein function,” Cellular Mol. Life Sci., vol. 60, no. 12,
and model-based methods, we highlighted the importance of pp. 2637–2650, 2003.
[21] C. E. Jones, A. L. Brown, and U. Baumann, “Estimating the annotation
these sorts of hybrid models too. Lastly, we discussed the error rate of curated go database sequence annotations,” BMC Bioinf.,
importance of interpretability of the models which are missing vol. 8, no. 1, pp. 1–9, 2007.
across the works and least focussed upon. The black-box training [22] S. Kaul et al., “Analysis of the genome sequence of the flowering plant
arabidopsis thaliana,” Nature, vol. 408, no. 6814, pp. 796–815, 2000.
of machine learning algorithms is being widely used due to [23] A. Karwath and R. D. King, “Homology induction: The use of machine
its ability to learn from a wide range of data. They have the learning to improve sequence similarity searches,” BMC Bioinf., vol. 3,
capability of learning abstract patterns which get missed out on no. 1, pp. 1–13, 2002.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
2300 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 20, NO. 3, MAY/JUNE 2023

[24] L. Wang, Y. Wang, and Q. Chang, “Feature selection methods for big data [47] F. Zhang, H. Song, M. Zeng, Y. Li, L. Kurgan, and M. Li, “DeepFunc: A
bioinformatics: A survey from the search perspective,” Methods, vol. 111, deep learning framework for accurate prediction of protein functions from
pp. 21–31, 2016. protein sequences and interactions,” Proteomics, vol. 19, no. 12, 2019,
[25] W. Li, B. Ma, and K. Zhang, “Optimizing spaced k-mer neighbors for Art. no. 1900019.
efficient filtration in protein similarity search,” IEEE/ACM Trans. Comput. [48] S. J. Giri, P. Dutta, P. Halani, and S. Saha, “MultiPredGo: Deep multi-
Biol. Bioinf., vol. 11, no. 2, pp. 398–406, 2014. modal protein function prediction by amalgamating protein structure,
[26] D. Cozzetto, F. Minneci, H. Currant, and D. T. Jones, “FFPred 3: Feature- sequence, and interaction information,” IEEE J. Biomed. Health Inform.,
based function prediction for all gene ontology domains,” Sci. Rep., vol. 6, vol. 25, no. 5, pp. 1832–1838, May 2021.
no. 1, pp. 1–11, 2016. [49] Y. Cai, J. Wang, and L. Deng, “SDN2GO: An integrated deep learning
[27] K.-L. Lin et al., “Feature selection and combination criteria for improving model for protein function prediction,” Front. Bioeng. Biotechnol., vol. 8,
accuracy in protein structure prediction,” IEEE Trans. Nanobiosci., vol. 6, 2020, Art. no. 391.
no. 2, pp. 186–196, 2007. [50] K. Fan, Y. Guan, and Y. Zhang, “Graph2GO: A multi-modal attributed
[28] M. Littmann, M. Heinzinger, C. Dallago, T. Olenyi, and B. Rost, “Embed- network embedding method for inferring protein functions,” GigaScience,
dings from deep learning transfer go annotations beyond homology,” Sci. vol. 9, no. 8, 2020, Art. no. giaa081.
Rep., vol. 11, no. 1, pp. 1–14, 2021. [51] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online learning of
[29] C. Wan and D. T. Jones, “Protein function prediction is improved by social representations,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl.
creating synthetic feature samples with generative adversarial networks,” Discov. Data Mining, 2014, pp. 701–710.
Nat. Mach. Intell., vol. 2, no. 9, pp. 540–550, 2020. [52] S. Agrawal, D. S. Sisodia, and N. K. Nagwani, “Long short term memory
[30] H.-J. Yu and D.-S. Huang, “Normalized feature vectors: A novel based functional characterization model for unknown protein sequences
alignment-free sequence comparison method based on the numbers of using ensemble of shallow and deep features,” Neural Comput. Appl.,
adjacent amino acids,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 10, vol. 34, no. 6, pp. 4831–4841, 2022.
no. 2, pp. 457–467, 2013. [53] R. You, X. Huang, and S. Zhu, “DeepText2GO: Improving large-scale pro-
[31] E. Asgari and M. R. Mofrad, “Continuous distributed representation of tein function prediction with deep semantic text representation,” Methods,
biological sequences for deep proteomics and genomics,” PLoS One, vol. 145, pp. 82–90, 2018.
vol. 10, no. 11, 2015, Art. no. e0141287. [54] J. Hou, R. Cao, and J. Cheng, “Deep convolutional neural networks for
[32] Y. Goldberg and O. Levy, “Word2vec explained: Deriving Mikolov et al.’s predicting the quality of single protein structural models,” bioRxiv, 2019,
negative-sampling word-embedding method,” 2014, arXiv:1402.3722. Art. no. 590620, doi: 10.1101/590620.
[33] Z. Du, Y. He, J. Li, and V. N. Uversky, “DeepAdd: Protein function [55] S. K. Sønderby, C. K. Sønderby, H. Nielsen, and O. Winther, “Convolu-
prediction from k-mer embedding and additional features,” Comput. Biol. tional LSTM networks for subcellular localization of proteins,” in Proc.
Chem., vol. 89, 2020, Art. no. 107379. Int. Conf. Algorithms Comput. Biol., Springer, 2015, pp. 68–80.
[34] T. Lu, A. X. Lu, and A. M. Moses, “Random embeddings and linear [56] Z. Zhang et al., “Deep learning in omics: A survey and guideline,” Brief.
regression can predict protein function,” 2021, arXiv:2104.14661. Funct. Genomic., vol. 18, no. 1, pp. 41–57, 2018.
[35] R. You et al., “NetGO: Improving large-scale protein function prediction [57] T. Sun, B. Zhou, L. Lai, and J. Pei, “Sequence-based prediction of protein
with massive network information,” Nucleic Acids Res., vol. 47, no. W1, protein interaction using a deep-learning algorithm,” BMC Bioinf., vol. 18,
pp. W379–W387, 2019. no. 1, 2017, Art. no. 277.
[36] A. Villegas-Morcillo, S. Makrodimitris, R. C. van Ham, A. M. Gomez, V. [58] Z. Guo, J. Hou, and J. Cheng, “DNSS2: Improved ab initio protein sec-
Sanchez, and M. J. Reinders, “Unsupervised protein embeddings outper- ondary structure prediction using advanced deep learning architectures,”
form hand-crafted sequence and structure features at predicting molecular Proteins Struct. Function Bioinf., vol. 89, no. 2, pp. 207–217, 2021.
function,” Bioinformatics, vol. 37, no. 2, pp. 162–170, 2021. [59] B. Lai and J. Xu, “Accurate protein function prediction via graph attention
[37] M. Mansoor, M. Nauman, H. Ur Rehman, and A. Benso, “Gene ontology networks with predicted structure information,” Brief. Bioinf., vol. 23,
GAN (GOGAN): A novel architecture for protein function prediction,” no. 1, 2022, Art. no. bbab502.
Soft Comput., vol. 25, pp. 7653–7667, 2022. [60] A. S. Rifaioglu, T. Doğan, M. J. Martin, R. Cetin-Atalay, and V. Atalay,
[38] J. Hong et al., “Protein functional annotation of simultaneously improved “DEEPred: Automated protein function prediction with multi-task feed-
stability, accuracy and false discovery rate achieved by a sequence-based forward deep neural networks,” Sci. Rep., vol. 9, no. 1, pp. 1–16, 2019.
deep learning,” Brief. Bioinf., vol. 21, no. 4, pp. 1437–1447, 2020. [61] M. Kulmanov and R. Hoehndorf, “DeepGOPlus: Improved protein func-
[39] C. W. Ko, J. Huh, and J. W. Park, “Deep learning program to predict tion prediction from sequence,” Bioinformatics, vol. 36, no. 2, pp. 422–429,
protein functions based on sequence information,” MethodsX, vol. 9, 2022, 2020.
Art. no. 101622. [62] M. Tanveer et al., “Machine learning techniques for the diagnosis of
[40] A. Ranjan, M. S. Fahad, D. Fernández-Baca, A. Deepak, and S. Tripathi, Alzheimer’s disease: A review,” ACM Trans. Multimedia Comput. Com-
“Deep robust framework for protein function prediction using variable- mun. Appl., vol. 16, no. 1s, pp. 1–35, 2020.
length protein sequences,” IEEE/ACM Trans. Comput. Biol. Bioinf., [63] V. N. Ioannidis, A. G. Marques, and G. B. Giannakis, “Graph neural
vol. 17, no. 5, pp. 1648–1659, Sep./Oct. 2020. networks for predicting protein functions,” in Proc. IEEE 8th Int. Workshop
[41] A. Ranjan, D. Fernandez-Baca, S. Tripathi, and A. Deepak, “An ensem- Comput. Adv. Multi-Sensor Adaptive Process., 2019, pp. 221–225.
ble Tf-Idf based approach to protein function prediction via sequence [64] M. E. Elhaj-Abdou, H. El-Dib, A. El-Helw, and M. El-Habrouk,
segmentation,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 19, no. 5, “Deep_CNN_LSTM_GO: Protein function prediction from amino-acid
pp. 2685–2696, Sep./Oct. 2022. sequences,” Comput. Biol. Chem., vol. 95, 2021, Art. no. 107584.
[42] A. Ranjan, A. Tiwari, and A. Deepak, “A sub-sequence based approach [65] R. Cao, C. Freitas, L. Chan, M. Sun, H. Jiang, and Z. Chen, “ProLanGO:
to protein function prediction via multi-attention based multi-aspect net- Protein function prediction using neural machine translation based
work,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 20, no. 1, pp. 94–105, on a recurrent neural network,” Molecules, vol. 22, no. 10, 2017,
Jan./Feb. 2023. Art. no. 1732.
[43] V. Gligorijević, M. Barot, and R. Bonneau, “DeepNF: Deep network fusion [66] M. Kulmanov, F. Zhapa-Camacho, and R. Hoehndorf, “DeepGOWeb: Fast
for protein function prediction,” Bioinformatics, vol. 34, no. 22, pp. 3873– and accurate protein function prediction on the (semantic) web,” Nucleic
3881, 2018. Acids Res., vol. 49, no. W1, pp. W140–W146, 2021.
[44] F. Z. Smaili et al., “QAUST: Protein function prediction using structure [67] R. Dhanuka and J. P. Singh, “Protein function prediction using functional
similarity, protein interaction, and functional motifs,” Genomic., Pro- inter-relationship,” Comput. Biol. Chem., vol. 95, 2021, Art. no. 107593.
teomic. Bioinf., vol. 19, pp. 998–1011, 2021. [68] R. Fa, D. Cozzetto, C. Wan, and D. T. Jones, “Predicting human protein
[45] M. Barot, V. Gligorijević, K. Cho, and R. Bonneau, “NetQuilt: Deep function with multi-task deep neural networks,” PLoS One, vol. 13, no. 6,
multispecies network-based protein function prediction using homology- 2018, Art. no. e0198216.
informed network similarity,” Bioinformatics, vol. 37, no. 16, pp. 2414– [69] C. Zhao, T. Liu, and Z. Wang, “PANDA2: Protein function prediction
2422, 2021. using graph neural networks,” NAR Genomic. Bioinf., vol. 4, no. 1, 2022,
[46] S. Yao, R. You, S. Wang, Y. Xiong, X. Huang, and S. Zhu, “NetGO 2.0: Art. no. lqac004.
Improving large-scale protein function prediction with massive sequence, [70] Y. Cao and Y. Shen, “TALE: Transformer-based protein function annota-
text, domain, family and network information,” Nucleic Acids Res., vol. 49, tion with joint sequence–label embedding,” Bioinformatics, vol. 37, no. 18,
no. W1, pp. W469–W475, 2021. pp. 2825–2833, 2021.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.
DHANUKA et al.: COMPREHENSIVE SURVEY OF DEEP LEARNING TECHNIQUES IN PROTEIN FUNCTION PREDICTION 2301

[71] X.-Y. Qiu, H. Wu, and J. Shao, “TALE-cmap: Protein function prediction [90] A. Clare, A. Karwath, H. Ougham, and R. D. King, “Functional bioinfor-
based on a TALE-based architecture and the structure information from matics for arabidopsis thaliana,” Bioinformatics, vol. 22, no. 9, pp. 1130–
contact map,” Comput. Biol. Med., vol. 149, 2022, Art. no. 105938. 1136, 2006.
[72] N. Brandes, D. Ofer, Y. Peleg, N. Rappoport, and M. Linial, “ProteinBERT: [91] G. L. Pappa, A. J. Baines, and A. A. Freitas, “Predicting post-synaptic
A universal deep-learning model of protein sequence and function,” Bioin- activity in proteins with data mining,” Bioinformatics, vol. 21, no. suppl_2,
formatics, vol. 38, no. 8, pp. 2102–2110, 2022. pp. ii19–ii25, 2005.
[73] R. Dhanuka, A. Tripathi, and J. P. Singh, “A semi-supervised autoencoder-
based approach for protein function prediction,” IEEE J. Biomed. Health
Inform., vol. 26, no. 10, pp. 4957–4965, Oct. 2022.
[74] L. J. Miranda and J. Hu, “A deep learning approach based on stacked
denoising autoencoders for protein function prediction,” in Proc. IEEE Richa Dhanuka received the BTech degree in infor-
42nd Annu. Comput. Softw. Appl. Conf., 2018, pp. 480–485.
mation Ttechnology from VIT University, Vellore,
[75] A. Elisseeff and J. Weston, “A kernel method for multi-labelled classifi-
India in 2011. She is currently working toward the
cation,” Adv. Neural Inf. Process. Syst., vol. 14, pp. 681–687, 2001.
MTech and PhD degrees with the Department of
[76] S. Diplaris, G. Tsoumakas, P. A. Mitkas, and I. Vlahavas, “Protein clas- Computer Science and Engineering in National In-
sification with multiple algorithms,” in Proc. Panhellenic Conf. Inform.,
stitute of Technology Patna, India. She has six years
Springer, 2005, pp. 448–456.
of work experience in the software industry. Her
[77] A. K. Tiwari and R. Srivastava, “A survey of computational intelligence
current research interests include machine learning
techniques in protein function prediction,” Int. J. Proteomic., vol. 2014,
and its applicability across domains, bioinformatics,
2014, Art. no. 845479.
artificial intelligence.
[78] G. Pandey, C. L. Myers, and V. Kumar, “Incorporating functional inter-
relationships into protein function prediction algorithms,” BMC Bioinf.,
vol. 10, no. 1, 2009, Art. no. 142.
[79] G. Yu, Y. Zhao, C. Lu, and J. Wang, “HashGO: Hashing gene ontology for
protein function prediction,” Comput. Biol. Chem., vol. 71, pp. 264–273,
2017. Jyoti Prakash Singh (Senior Member, IEEE) re-
[80] Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya, “Hierarchical ceived the PhD degree from the University of Cal-
multi-label prediction of gene function,” Bioinformatics, vol. 22, no. 7, cutta. He is an assistant professor with the department
pp. 830–836, 2006. of Computer Science and Engineering, the National
[81] R. Eisner, B. Poulin, D. Szafron, P. Lu, and R. Greiner, “Improving protein Institute of Technology Patna, India. He has coau-
function prediction using the hierarchical structure of the gene ontology,” thored seven textbooks and one edited book. He has
in Proc. IEEE Symp. Comput. Intell. Bioinf. Comput. Biol., 2005, pp. 1–10. published more than 50 international journal publi-
[82] M. Masseroli, D. Chicco, and P. Pinoli, “Probabilistic latent semantic cations in leading publishers and more than 55 in-
analysis for prediction of gene ontology annotations,” in Proc. IEEE Int. ternational conference proceedings. He was involved
Joint Conf. Neural Netw., 2012, pp. 1–8. as an investigator in the MietY-sponsored project
[83] J. Meng, J.-S. Wekesa, G.-L. Shi, and Y.-S. Luan, “Protein function to develop algorithms for spam calls/fake calls in a
prediction based on data fusion and functional interrelationship,” Math. telephonic conversation. His research interests focus on social media mining,
Biosci., vol. 274, pp. 25–32, 2016. deep learning, information security and speech processing. He is an associate
[84] S. Makrodimitris, R. C. van Ham, and M. J. Reinders, “Improving protein editor of the International Journal of Electronic Government Research. He was
function prediction using protein sequence and go-term similarities,” awarded S4DS Data Scientist (Academia) Award by the society for Data Science
Bioinformatics, vol. 35, no. 7, pp. 1116–1124, 2019. in 2020.
[85] K. S. Ahmed, N. H. Soloma, and Y. M. Kadah, “Exploring protein functions
correlation based on overlapping proteins and cluster interactions,” in Proc.
IEEE 1st Middle East Conf. Biomed. Eng., 2011, pp. 247–251.
[86] S. F. Seyyedsalehi, M. Soleymani, H. R. Rabiee, and M. R. Mofrad, “PFP-
WGAN: Protein function prediction by discovering gene ontology term
correlations with generative adversarial networks,” PLoS One, vol. 16, Anushree Tripathi (Member, IEEE) is working as
no. 2, 2021, Art. no. e0244430. faculty with the Department of Computer Science
[87] M. Li, W. Shi, F. Zhang, M. Zeng, and Y. Li, “A deep learning frame- and Engineering, National Institute of Technology,
work for predicting protein functions with co-occurrence of GO terms,” Patna (NITP), India. She has postdoctoral experience
IEEE/ACM Trans. Comput. Biol. Bioinf., early access, Apr. 27, 2022, from Indian Institute of Science (IISc) Bangalore.
doi: 10.1109/TCBB.2022.3170719. She has published 16 research papers, presented her
[88] D. Szafron et al., “Proteome analyst: Custom predictions with explanations research work in about 10 international and national
in a web-based tool for high-throughput proteome annotations,” Nucleic conferences, visited USA and Italy number of times
Acids Res., vol. 32, no. suppl_2, pp. W365–W371, 2004. to deliver lectures and participate in scientific discus-
[89] A. A. Freitas, D. C. Wieser, and R. Apweiler, “On the importance of sions. She has been awarded National Academy of
comprehensible classification models for protein function prediction,” Sciences India, NASI (oldest science academy of In-
IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 7, no. 1, pp. 172–182, First dia) Swarna Jayanti Award. She has been conferred travel award by International
Quarter 2010. Conference on Intelligent Biology and Medicine (ICIBM2018), USA.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on July 18,2023 at 11:07:53 UTC from IEEE Xplore. Restrictions apply.

You might also like