Genomics xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Genomics
journal homepage: www.elsevier.com/locate/ygeno
Compositional dynamics and codon usage pattern of BRCA1 gene across
nine mammalian species
⁎
Supriyo Chakrabortya, , Tarikul Huda Mazumdera, Arif Uddinb,
a
b
⁎
Department of Biotechnology, Assam University, Silchar 788011, Assam, India
Department of Zoology, Moinul Hoque Choudhury Memorial Science College, Algapur, Hailakandi 788150, Assam, India
A R T I C L E I N F O
A B S T R A C T
Keywords:
Cancer
Breast
Molecular genetics
Codon usage bias
The BRCA1 gene is located on the human chromosome 17q21.31 and plays important role in biological processes. The aminoacyl-tRNA synthetases (AARS) are a family of heterogenous enzymes responsible protein
synthesis and whose secondary functions include a role in autoimmune myositis. Our findings reveal that the
compositional constraint and the preference of more A/T –ending codons determine the codon usage patterns in
BRCA1 gene while more G/C-ending codons influence the codon usage pattern of AARS gene among mammals.
The codon usage bias in BRCA1 and AARS genes is low. The codon CGC encoding arginine amino acid and the
codon TTA encoding leucine were uniformly distributed in BRCA1 and AARS genes, respectively in mammals
including human. Natural selection might have played a major role while mutation pressure might have played a
minor role in shaping the codon usage pattern of BRCA1 and AARS genes.
1. Introduction
Genetic code is degenerate meaning that more than one codon encodes the same amino acid. Unequal usage of synonymous codons for
encoding the same amino acid during translation of a gene transcript
into a protein is a well-established phenomenon commonly known as
codon usage bias (CUB). It is species specific and significantly differs
among the genes of the same taxa [3,12,31,35]. The codon usage patterns have been analyzed since the outstanding efforts for the creation
of the first molecular sequence databases were initiated [12]. The result
of Grantham and his co-workers demonstrated that species specific
genes share similar patterns of synonymous codon usage frequency as
stated by the “genome hypothesis” [11,12]. Therefore, scanning the
codon usage patterns of all the genes in an organism may obscure the
underlying heterogeneity [2] and hence it is better to identify the trends
of codon usage patterns within the genes of a species or between closely
related species. Various factors responsible for codon usage bias in
different organisms from lower prokaryotes to higher eukaryotes have
been discussed earlier by researchers across the globe but till date the
codon usage patterns within the genes of an organism during the course
of evolution have been interpreted for varied explanations. In general,
researchers reported that the compositional constraints under mutation
pressure or natural selection have been considered as the major factors
involved in the codon usage variation among different organisms
[8,20,26,48].
⁎
The BRCA1 gene in human is located on the chromosome 17q21.31
and comprises of 24 exons and its coding region encodes a protein of
1863 amino acids [33]. Multiple functions of BRCA1 attributed to its
tumor activity include progression of cell cycle, DNA damage repair
process and regulation of specific set of pathways as well as germ line
mutations in its sequence. The predisposition of these functions of
BRCA1 gene to breast and ovarian cancer in affected individuals [36]
has been discussed earlier but the comparative analysis of synonymous
codon usage influencing the codon bias in BRCA1 gene among mammals with reference to human has not been done so far.
Housekeeping genes are typically constitutive genes that carry out
the maintenance of basic cellular functions, and are expressed in all
cells of an organism under normal and patho-physiological conditions
[9,22]. The AARS gene encodes the enzyme alanyl-tRNA synthetase and
catalyzes the binding of alanine amino acid to the appropriate tRNA.
The aminoacyl-tRNA synthetases are a family of heterogenous enzymes
responsible protein synthesis and their secondary functions include a
role in autoimmune myositis [18].
In this study, an attempt has been made to analyze the codon bias
and codon context patterns in the coding sequences of BRCA1 and
compared with one house keeping gene (AARS) having same length
across mammals using the codon bias measures like effective number of
codons (ENC), relative synonymous codon usage (RSCU) and relative
abundance of dinucleotides. Further, in order to understand the extent
of selection pressure acting on the protein coding BRCA1 and AARS
Corresponding author.
E-mail addresses: supriyoch_2008@rediffmail.com (S. Chakraborty), arif.uddin29@gmail.com (A. Uddin).
https://doi.org/10.1016/j.ygeno.2018.01.013
Received 1 September 2017; Received in revised form 22 December 2017; Accepted 22 January 2018
0888-7543/ © 2018 Elsevier Inc. All rights reserved.
Please cite this article as: Chakraborty, S., Genomics (2018), https://doi.org/10.1016/j.ygeno.2018.01.013
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
Table 1
Coding sequence accession number and length (bp) of BRCA1 and AARS (alanyl-tRNA synthetase) genes across mammals.
Sl. no.
Mammal
Lengtha (bp)
BRCA1 gene
Accession no.
1
2
3
4
5
6
7
8
9
a
Homo sapiens
Pan troglodytes
Pongo pygmaeus abelii
Nomascus leucogenys
Nomascus gabriellae
Macaca fascicularis
Macaca mulatta
Papio anubis
Miopithecus talapoin
Lengtha (bp)
AARS gene
Accession no.
gi|353441748
gi|113865840
gi|667713698
gi|667713708
gi|667713690
gi|672890339
gi|169234601
gi|692110329
gi|667713714
5589
5589
5589
5589
5589
5589
5589
5589
5589
NM_001605.2
XM_016930106.1
NM_001131919.1
XM_005592535.2
XM_015126512.1
XM_003918580.4
–
–
–
2904
2904
2904
2904
2904
2904
–
–
–
Coding sequence length excludes the stop codon.
Table 2
Nucleotide composition (%) at three codon positions and AT-GC contents (%) of synonymous codons in the coding sequences of BRCA1 and AARS genes across mammals.
No.
A
T
G
C
A3
T3
G3
C3
AT
GC
AT3
GC3
BRCA1 gene
1
34.7
2
34.8
3
34.9
4
34.8
5
34.8
6
34.9
7
34.9
8
34.9
9
34.8
M
34.83
SD
0.071
24.1
24.0
24.0
24.0
24.1
24.2
24.2
24.3
24.2
24.12
0.109
21.8
21.7
21.6
21.6
21.6
21.5
21.4
21.5
21.6
21.59
0.117
19.4
19.5
19.5
19.6
19.5
19.4
19.5
19.3
19.4
19.46
0.088
31.4
31.5
31.6
31.4
31.3
31.5
31.6
31.7
31.7
31.52
0.139
34.1
34.0
33.8
33.7
33.8
34.3
34.2
34.2
34.2
34.0
0.218
17.7
17.6
17.7
17.8
17.8
17.7
17.5
17.6
17.7
17.68
0.097
16.8
16.9
16.9
17.1
17.1
16.5
16.7
16.5
16.4
16.77
0.259
58.8
58.8
58.9
58.8
58.8
59.1
59.1
59.2
59.0
58.94
0.159
41.2
41.2
41.1
41.2
41.2
40.9
40.9
40.8
41.0
41.05
0.159
65.5
65.5
65.4
65.2
65.1
65.8
65.8
65.9
65.9
65.57
0.300
34.5
34.5
34.6
34.8
34.9
34.2
34.2
34.1
34.1
34.43
0.300
AARS gene
1
25.0
2
25.1
3
25.2
4
25.2
5
25.3
6
25.3
M
25.18
SD
0.117
21.3
21.3
21.0
20.9
21.0
20.8
21.05
0.208
28.6
28.5
28.3
28.5
28.5
28.5
28.48
0.098
25.1
25.1
25.5
25.4
25.2
25.4
25.28
0.172
15.3
15.3
15.8
15.8
16.1
15.9
15.70
0.329
24.6
24.6
23.9
23.6
23.8
23.2
23.95
0.558
28.7
28.6
28.3
29.0
28.6
28.9
28.68
0.248
31.4
31.5
32.0
31.6
31.5
32.0
31.67
0.266
46.4
46.4
46.2
46.1
46.2
46.1
46.23
0.137
53.6
53.6
53.8
53.9
53.8
53.9
53.76
0.137
39.9
39.9
39.7
39.3
39.9
39.1
39.63
0.350
60.1
60.1
60.3
60.7
60.1
60.9
60.37
0.350
M: mean; SD: standard deviation.
Table 3
Correlation coefficients among nucleotide compositions at three codon positions in the coding sequences of BRCA1 gene (below the
diagonal) and for AARS gene (above the diagonal in blue) across mammals.
⁎
A1
A2
A3
T1
T2
T3
C1
C2
C3
G1
G2
G3
A1
0
-0.71
0.79
-0.71
0.76
-0.96**
0.89*
0.18
0.66
-0.94**
0.72
0.67
A2
0.27
0
-0.82*
0.54
-0.70
0.62
-0.71
-0.10
-0.18
0.61
-0.94**
-0.41
A3
-0.03
-0.27
0
-0.80
0.46
-0.84*
0.95**
0.52
0.63
-0.88*
0.70
0.16
T1
0.45
0.04
-0.48
0
-0.31
0.77
-0.90*
-0.54
-0.74
0.83*
-0.32
-0.04
T2
0.53
-0.48
0.58
0.22
0
-0.67
0.48
-0.47
0.29
-0.51
0.79
0.81*
T3
0.15
-0.35
0.60
-0.51
0.33
0
-0.93**
-0.27
-0.83*
0.97**
-0.60
-0.50
C1
-0.29
0.04
0.57
-0.49
0.22
-0.10
0
0.53
0.75
-0.97**
0.60
0.28
C2
0.67*
-0.30
0.02
0.10
0.53
0.50
-0.40
0
0.39
-0.48
-0.07
-0.44
C3
-0.02
0.54
-0.76*
0.56
0.49
-0.93**
-0.16
0.41
0
-0.80
0.09
0.09
G1
-0.71*
-0.23
-0.10
-0.57
-0.64
0.31
-0.25
-0.18
-0.23
0
-0.56
-0.40
G2
-0.93**
0.15
-0.18
-0.22
-0.64
0.31
0.12
-0.80*
0.22
0.64
0
0.62
G3
-0.20
-0.03
-0.82**
0.45
-0.35
-0.78*
-0.17
-0.18
0.72*
-0.04
0.35
0
p < 0.05,
⁎⁎
p < 0.01.
2
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
Fig. 1. Heat map representation of the correlation coefficient between codon usage and GC3s in the coding sequences of BRCA1 and AARS genes among mammals.
codon usage bias in the gene [47]. ENC value shows an inverse relationship with the degree of codon bias.
genes, we have measured the ratio of nonsynonymous substitution per
nonsynonymous site to the number of synonymous substitution per
synonymous site (dN/dS ratio) between human and closely related
mammals. Our current study provides an insight into the patterns of
codon usage in gaining the clues for codon optimization to alter the
translational efficiency as well as for the functional conservation of
gene expression and also the significance of nucleotide composition in
BRCA1 and housekeeping (AARS) gene within mammals.
2.4. Relative synonymous codon usage
The relative synonymous codon usage (RSCU) values of different
codons in the coding sequences of BRCA1 and AARS genes were calculated as per Comeron and Aguade [7] using the formula:
2. Methodology
RSCU =
2.1. Sequence data
gij
ni
ni
∑ j gij
Nucleotide coding sequences (CDS) of equal length having perfect
start and stop codon, devoid of any unknown bases (N) and exact
multiple of three bases for BRCA1 gene for nine mammals (n = 9) were
retrieved with accession number (Table 1) from GenBank database of
the National Center for Biotechnology Information (NCBI) (http://
www.ncbi.nlm.nih.gov). Moreover, the CDS of a housekeeping gene i.e.
alanyl-tRNA synthetase (AARS) gene for six mammals (n = 6) were
retrieved with accession number (Table 1) having equal length for
comparative analysis with BRCA1 with the house keeping gene.
where, gij is the relative codon usage frequency of the ith codon for the
jth amino acid which is encoded by ni synonymous codons [7]. In our
analysis, RSCU value > 1.0 represents positive codon usage bias while
RSCU value < 1.0 indicates a negative codon usage bias for the corresponding amino acid. Moreover, synonymous codons with RSCU
value > 1.6 are considered as over-represented and those with
RSCU < 0.6 as under-represented codons, respectively [46].
2.2. Nucleotide composition analysis
The relative abundance of sixteen dinucleotides in the coding sequences of BRCA1 and AARS genes across mammals was determined
using the approach of Chiusano et al. [6]. The odd ratio of each dinucleotide was computed using the formula:
2.5. Relative dinucleotide abundance
The occurrence of the nucleotide A, T, G, C contents (%), overall
frequency of four nucleotides at third position i.e. A3, T3, G3, C3 (%)
along with GC and AT contents (%) at different positions of synonymous codons were estimated in the coding sequences of BRCA1 and
AARS genes across selected mammals in order to examine the extent of
base compositional bias.
Pxy =
f y fx
where, fx and fy denote the frequency of the nucleotide X and Y respectively and fxy denotes the frequency of the dinucleotide XY. In our
analysis, pxy > 1.23 was considered as the over-represented dinucleotide and pxy < 0.78 as the under-represented dinucleotide in terms of
relative abundance.
2.3. Effective number of codons
The observed effective number of codons (ENC) for each coding
sequence of BRCA1 and AARS gene was calculated using the formula
given by Wright [47] as follows:
ENC = 2 +
fxy
9
1
5
3
+
+
+
F2
F3
F4
F6
2.6. Analysis of selection pressures on the coding sequence of BRCA1 and
AARS genes
where, Fk expression (k = 2, 3, 4 or 6) is the average of the Fk values for
k-fold degenerate amino acids. The F value denotes the probability that
two randomly chosen codons for an amino acid with two codons are
identical.
ENC value generally ranges from 20 to 61. Low ENC value (< 35)
indicates high codon usage bias and higher ENC value reveals low
The degree of nonsynonymous substitution (dN) per nonsynonymous site, synonymous substitution per synonymous site (dS) and the
ratio dN to dS i.e. dN/dS were estimated as per Nielsen and Yang [30] to
assess the effect of natural selection that acted on BRCA1 and AARS
genes during the course of evolution.
3
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
influence of amino acid composition. Each CDS was represented as a
59-dimensional vectors, and each dimension corresponded to the RSCU
value of one sense codon with the exception of ATG (methionine), TGG
(tryptophan) and three stop codons. The major trends in codon usage
variation can be determined with relative inertia, according to which
the coding sequences are analyzed to investigate the major factors affecting the codon usage pattern. COA was done using XLSTAT Pro
software.
Table 4
Overall relative synonymous codon usage (RSCU) values in the coding sequences of
BRCA1 and AARS genes among mammals.
AA
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Phe
Pro
Ser
Thr
Tyr
Val
Codon
GCA**
GCC*
GCG
GCT*
CGT
CGC
CGA*
CGG*
AGA**
AGG**
AAC*
AAT*
GAT*
GAC*
TGC*
TGT*
CAA
CAG*
GAA*
GAG*
GGA*
GGC*
GGG
GGT*
CAT*
CAC*
ATA*
ATC*
ATT*
TTA*
TTG*
CTA
CTC*
CTG*
CTT
AAA*
AAG
TTT*
TTC
CCA*
CCC
CCG
CCT**
TCA
TCC
TCG
TCT**
AGC*
AGT**
ACA*
ACC
ACG
ACT**
TAC
TAT*
GTA
GTC
GTG*
GTT*
BRCA1
AARS
N.
RSCUa
N.
RSCUa
305
139
28
280
43
18
42
28
317
233
355
733
523
237
95
297
359
539
1194
561
267
140
143
200
313
149
252
182
270
299
291
214
137
260
211
710
507
279
160
335
127
5
371
336
159
10
596
384
571
340
209
36
417
127
154
206
133
262
332
1.63
0.75
0.15
1.49
0.38
0.17
0.37
0.26
2.80
2.05
0.65
1.35
1.37
0.63
0.48
1.52
0.80
1.20
1.36
0.64
1.43
0.74
0.76
1.07
1.36
0.64
1.08
0.78
1.16
1.27
1.23
0.90
0.58
1.11
0.90
1.17
0.83
1.27
0.74
1.60
0.60
0.02
1.77
0.97
0.45
0.01
1.75
1.13
1.67
1.36
0.84
0.15
1.67
0.90
1.10
0.88
0.57
1.12
1.43
98
228
19
181
8
40
64
96
35
55
117
87
199
252
39
35
31
220
154
232
95
172
106
97
29
73
10
150
140
24
69
28
115
270
49
140
256
129
92
47
88
12
72
33
67
10
70
86
28
88
70
54
103
81
75
35
111
213
44
0.73
1.74
0.14
1.37
0.16
0.80
1.28
1.94
0.70
1.10
1.15
0.85
0.88
1.12
1.05
0.95
0.25
1.75
0.80
1.20
0.81
1.45
0.91
0.83
0.57
1.43
0.10
1.50
1.40
0.24
0.75
0.29
1.24
2.91
0.54
0.71
1.29
1.16
0.84
0.87
1.61
0.21
1.31
0.66
1.36
0.20
1.43
1.77
0.56
1.11
0.89
0.68
1.31
1.04
0.96
0.35
1.09
2.11
0.44
2.8. Neutrality plot
Mutations that mostly occur in the 3rd position of synonymous
codons result in synonymous mutation, whereas mutation that occurs in
1st and 2nd codon position leads to nonsynonymous change (amino
acid changing type). Nonsynonymous mutations occur less frequently
since they may affect gene functionality. Theoretically mutations
should occur randomly at three positions of codons in a DNA molecule
if there is no external pressure. The preference of bases in three different codon positions would not be same in the presence of selection
pressure [37]. Neutrality plot, a graphical plot of GC12 against GC3
depicts the role of directional mutational pressure and natural selection.
In this plot, regression coefficient of GC12 on GC3 is the equilibrium
state of mutation and selection [37].
2.9. Software used
A PERL program was developed to estimate the codon usage bias
indices and the selection pressure on the coding sequences of BRCA1
and AARS genes. Statistical analysis was carried out using the IBM SPSS
version 21.0 and the heat map (cluster analysis) was generated with
NetWalker software version 1.0 [17]. Phylogenetic analysis based on
nonsynonymous substitution was performed with Mega 6.0 software
[39].
2.10. Correlation analysis
Correlation coefficient between any two parameters was estimated
by Karl Pearson's product moment method to assess the presence and
the degree of relationship between the parameters. The significance of
the correlation coefficient was tested by t-test for (n-2) degrees of
freedom at p < 0.01 or p < 0.05.
2.11. Skewness analysis of nucleotides
Skewness of any two nucleotides (x,y) was estimated as (x − y)/
(x + y) to understand the compositional dynamics of nucleotide composition in coding sequences. A positive value of skewness between x
and y nucleotides indicates the preponderance of x over y nucleotide
while a negative value reveals less abundance of x over y in the coding
sequence. The skewness value deviating from zero clearly indicates
unequal usage of two nucleotides in the transcript.
3. Results and discussion
3.1. Nucleotide compositions in BRCA1 and AARS genes across mammals
a
Mean values of RSCU based on the synonymous codon usage frequency, AA: amino
acid, N: total number of codons, *RSCU > 1.0, **RSCU > 1.6.
Nucleotide compositions in the coding sequences of BRCA1 and
AARS genes among the selected mammals were analyzed (Table 2). In
case of BRCA1 gene, the result showed that the overall percentage of AT
(58.94%) content was higher than GC (41.05%) content. But, the nucleotide composition analysis of the AARS gene showed that the overall
percentage of GC (53.76%) content was higher than AT (46.23%)
content. It is a well-known fact that the nucleotide at the third codon
position varies considerably due to wobble hypothesis which allows the
cell to identify all the 61 sense codons on the mRNA by a few tRNA
molecules. In our analysis, we observed that the bases T and C were the
2.7. Correspondence analysis
Correspondence analysis is generally used to investigate the major
trend in codon usage variation among genes [34,44]. To explore the
variation in codon usage in BRCA1 and AARS genes among nine
mammalian species, the RSCU values of codons for all CDS selected in
this study were used for correspondence analysis to reduce the
4
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
Fig. 2. Graphical representation of nucleotide skewness values for BRCA1 and AARS genes among mammals.
Fig. 3. Correspondence analysis of RSCU values in BRCA1 and AARS genes. Each point in the plot represents the distribution of a gene corresponding to the coordinates of the primary
and secondary axes of variation. Black color indicates the codons while blue color indicates different species. (For interpretation of the references to color in this figure legend, the reader
is referred to the web version of this article.)
Fig. 4. Neutrality plot analysis of GC12 versus GC3 in BRCA1 and AARS genes.
5
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
Fig. 5. Line diagram showing the relative abundance of 16 dinucleotides in BRCA1 and AARS genes.
Fig. 6. Heat map representation of amino acid usage in BRCA1 and AARS across mammals. Each rectangular box with color bar represents the occurrence of amino acid frequency (shown
in columns) corresponds to [A] BRCA1 gene and [B] AARS gene across mammals (shown in rows). (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of this article.)
mammals [24].
most frequently used ones at the 3rd position of codons in the coding
sequences of BRCA1 and AARS genes, respectively. Moreover, in BRCA1
gene A1 showed a significant positive correlation with C2 but significant negative correlation with G1 and G2. However, both A3 and T3
were negatively correlated (p < 0.01) with C3 and G3 respectively. In
addition, C3 had significant positive correlation with G3 in the CDS of
BRCA1 gene (Table 3) [49]. In contrast, the CDS of AARS gene showed
that C1 had a significant negative correlation (p < 0.01) with G1 but
T3 was strongly negatively correlated with C1 but positively correlated
with G1. Similarly, A1 showed a significant negative correlation with
T3 and G1 (Table 3) [49]. Similar to BRCA1 gene, the nucleotide distribution of the members of albumin superfamily also exhibited low GC
content (< 44.63%) [25]. Uddin and Chakraborty [42] also reported
the low GC content of mitochondrial CYB genes [42]. Similar to AARS
gene, the overall GC content was higher in GATA2 gene across
3.2. Codon usage bias of BRCA1 and AARS genes
The observed ENC value in the coding sequences of BRCA1 and
AARS genes among mammals ranged from 49 to 51, indicating low bias
in synonymous codon usage and it indicated that all the synonymous
codons were used almost equally for the corresponding amino acid in
both BRCA1 and AARS genes of mammals [49]. The low codon usage
bias of a gene might be helpful for efficient replication in vertebrates
with different cell types having different preferences of codons [14,43].
Moreover, low codon bias of a gene indicates the presence of greater
genetic variability for synonymous codon usage in the gene. High codon
bias arises in a gene when one or few codons within a synonymous
family is preferred in the mRNA.
6
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
Fig. 7. Variation of synonymous and nonsynonymous mutation for amino acid in BRCA1 at different positions across nine mammals.
(r = −0.853, p < 0.01) between ENC and AT contents were observed
for BRCA1 gene across mammals. However, for AARS gene, ENC and
GC content were negatively correlated but ENC and AT content were
positively correlated. The ENC value was negatively correlated with A3
and T3 but positively correlated with G3 and C3 for the BRCA1 gene
across mammals. While for the CDS of AARS gene, the ENC value was
negatively correlated with A3 and G3 but positively correlated with T3
and C3. These findings suggest that, compositional constraints were one
of the important factors in determining the codon usage patterns in
BRCA1 as well as AARS genes across mammals [5].
Relatively weak codon bias exists in the coding sequence of BRCA1
and AARS genes across mammals as reflected by high ENC values
(49–51). Similarly, the mean ENC value of GATA2 gene was
41.60 ± 7.33 suggesting the existence of relatively weak codon bias
across mammals [24]. ENC value ranged from 51.65 to 56.62 in albumin superfamily genes in human. The overall ENC value of these
genes was > 50 which suggested that the synonymous codons were
used equally in albumin superfamily genes and hence showed less
codon usage bias [25]. The ENC values for CYB gene in different species
of pisces, aves and mammals were 58.33, 59.66, and 58.33 respectively
[43].
3.5. Relative synonymous codon usage (RSCU) of BRCA1 and AARS genes
among mammals
3.3. Pattern of codon usage
To elucidate the relationship of the codon usage variation with GC
constraints among the selected coding sequences of BRCA1 and AARS
genes, we analyzed the correlation coefficients of codon usage with GC3
using heat map (Fig. 1). We observed that some of the codons displayed
positive correlation while some other codons showed negative correlation with GC3. The codon CGC encoding arginine amino acid and the
codon TTA encoding leucine were uniformly distributed in BRCA1 and
AARS genes respectively in mammals including human. These results
from our analysis suggest that the usage frequency of positively correlated codons will increase with the increase of GC bias and that of
negatively correlated codons will decrease with the increase of GC bias
[24,32]. In GATA2 gene, it was reported that most of the codons with
G/C-ending base in the coding sequences were positively correlated
with GC3 indicating that codon usage had been influenced by the GC
bias but little by the A/T -ending base [24].
The total RSCU values of 59 sense codons excluding the codon ATG
and TGG encoding the amino acid methionine and tryptophan, respectively and three stop codons were analyzed in the coding sequences
of BRCA1 and AARS genes among mammals (Table 4). We observed
that 29 codons (A–ending 9, T–ending 14, G–ending 5, C–ending 1)
were more frequently used (RSCU > 1.0) where in T/A-ending codons
were predominant over G/C-ending codons. Among the 29 more frequently used codons, four T-ending (CCT, TCT, AGT and ACT) codons,
two A-ending (GCA, AGA) codons and one G-ending (AGG) codon were
highly used (RSCU > 1.6) in BRCA1 gene. In contrast, for AARS gene,
28 codons (A–ending 2, T–ending 6, G–ending 7, C–ending 13) were
more frequently used (RSCU > 1.0) where C–ending codons were
more predominant over other codons. Furthermore, from our analysis it
was revealed that the codons AGA and AGG encoding arginine amino
acid for BRCA1 gene and the codons CTG, GTG encoding leucine and
valine amino acid respectively for the CDS of AARS gene across mammals had the highest RSCU value i.e. > 2.0. Our analysis revealed that
T-ending codon was mostly favored in the coding sequences of BRCA1
gene but C–ending codon was mostly favored in the coding sequences of
AARS gene across mammals [42]. For GATA2 gene the relative codon
usage frequency revealed that C-ending codons were mostly preferred
to G-ending codons across mammals. The codon ATT encoding
3.4. Relationship of codon usage bias with compositional properties
We performed correlation analysis between codon usage bias and
compositional properties to understand the effect of base composition
on codon usage bias. Significant positive correlation (r = 0.853,
p < 0.01) between ENC and GC contents but negative correlation
7
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
Fig. 8. Phylogenetic tree based on nonsynonymous (dN) distance using codon alignment [A] BRCA1 gene and [B] AARS gene across mammals. The tree was drawn to infer the
evolutionary history using neighbor-joining method with 1000 bootstrap replicates, conducted in MEGA6.
significantly in exons and introns of human genes [21]. In order to find
out the relationship between the nucleotide skewness with codon usage
bias, we calculated the skew value from the variation in base composition within each coding sequence of BRCA1 and AARS genes. Skewness values for the GC, AT, keto, amino, purine and pyrimidine bases
revealed that base composition bias is linked to transcription processes
[10,41]. In our analysis, positive skew value was observed (Fig. 2) in
case of GC, AT, GC3, keto, amino, purine and pyrimidine bases [27] in
the coding sequence of BRCA1 gene across mammals. However, in the
coding sequence of AARS gene across mammals, we observed positive
skew value in GC and AT bases but negative skew value for keto, amino,
purine and pyrimidine bases [43].
Table 5
Comparisons between human and other mammals for the ratio of the number of nonsynonymous substitution per nonsynonymous site to the number of synonymous substitution per synonymous site (dN/dS) in the coding sequences of BRCA1 and AARS genes.
Homo
Homo
Homo
Homo
Homo
Homo
Homo
Homo
sapiens
sapiens
sapiens
sapiens
sapiens
sapiens
sapiens
sapiens
vs.
vs.
vs.
vs.
vs.
vs.
vs.
vs.
Pan troglodytes
Pongo pygmaeus abelii
Nomascus leucogenys
Nomascus gabriellae
Macaca fascicularis
Macaca mulatta
Papio anubis
Miopithecus talapoin
BRCA1
AARS
dN/dS
dN/dS
1.600
0.667
0.750
0.833
5.500
5.810
5.500
6.000
0.093
0.099
–
–
0.059
0.068
0.069
–
3.7. Correspondence analysis
To determine the trends in codon usage variation of BRCA1 and
AARS genes, we performed correspondence analysis (COA) based on the
RSCU values of 59 synonymous codons. We observed that the first
principal axis (f1) accounted for 44.91% of total variation, whereas the
second axis (f2) accounted for only 33.37% of variation (Fig. 3) in
BRCA1 gene. However, in case of AARS gene, the first principal axis (f1)
accounted for 62.57% of total variation, whereas the second axis (f2)
accounted for only 23.92% (Fig. 3). In these plots, the positions of the
isoleucine showed the RSCU value zero because nature might have
disfavored this codon in GATA2 gene across the mammals [24].
3.6. Relationship between nucleotide skewness and codon usage
It was reported that due to differential mutational pressure, the
usage of nucleotide frequency varies across the genes and differs
8
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
codons are more close to axes, indicating that the base composition for
mutation bias might correlate to the codon usage of BRCA1 and AARS
genes [15,45].
dinucleotides were over-represented in BRCA1 and AARS genes across
mammals which might be due to the effect of CpG dinucleotide as reported earlier in different organisms [4].
3.8. Neutrality plot analysis
3.10. Amino acid usage and codon bias
We performed a neutrality plot analysis of GC12 versus GC3 in order
to understand the influence of selection and mutation pressure on
codon usage bias of BRCA1 and AARS genes [37]. In neutrality plot
analysis, when a gene is located in the figure on the slope of unity there
exists a significant correlation between GC12 and GC3, indicating that
the gene is under neutral mutation pressure through random selective
pressure. But if the gene is under directional mutation pressure, the
gene would fall below the slope of unity, i.e. closer to X-axis and farther
from the Y-axis. Therefore, a regression line with a slope < 1 indicates
that a non-neutral mutation pressure affects the codon usage in the gene
within the same genome [28,38]. In our analysis, we observed nonsignificant (p > 0.05) negative correlation (Pearson r = −0.156,
r = −0.194) between GC12 and GC3 for BRCA1 and AARS genes respectively. Further, we estimated the magnitude of natural selection
and mutation pressure using regression coefficient. The regression
coefficient of GC12 on GC3 in BRCA1 gene was 0.409, indicating that the
relative neutrality was 40.9%, while the relative constraint was 59.1%
for GC3 (Fig. 4). Similarly for AARS gene, the regression coefficient of
GC12 on GC3 was 0.054 which indicated the relative neutrality was
5.4% and the relative constraint was 94.6%. These results made us
believe that natural selection played a major role while mutation
pressure played a minor role in the codon usage bias of BRCA1 and
AARS genes across mammals [13,43].
The frequency distributions of amino acid usage in BRCA1 and
AARS proteins in the selected mammals were grouped using hierarchical clustering with Euclidean distance and represented in a heat
map (Fig. 6). The outcome of the results showed that amino acids serine
(S), glutamate (E), leucine (L), lysine (K), asparagine (N) and to a lesser
extent threonine (T) and valine (V) were the most over-represented
amino acids in BRCA1 protein. Similarly, the amino acids leucine (L),
alanine (A), glycine (G), aspartate (D) and valine (V) were the overrepresented amino acids in AARS protein. Multiple amino acid sequence alignment of 1863 amino acid residues for BRCA1 protein
across mammals was implemented in Mega6 under CodonW alignment
(Fig. 7). Our results showed that amino acid at different positions in
BRCA1 protein radically changed in human when compared with other
mammals during evolution [24]. In GATA2 gene, four amino acids,
namely alanine (A), glycine (G), proline (P) and serine (S) were more
widely used and two amino acids isoleucine (I) and tryptophan (W)
were least used [24].
3.11. Non-synonymous substitution in BRCA1 and AARS genes corresponds
to phylogeny of mammals
The coding sequences of BRCA1 and AARS genes from the selected
mammals were aligned in clustalW2 program [40]. Phylogenetic trees
based on nonsynonymous substitution in the complete coding sequences of BRCA1 and AARS genes were constructed (Fig. 8) by
neighbor-joining method using MEGA6 [39]. The evolutionary distances were computed using the Nei-Gojobori method [29] and were in
the units of the number of nonsynonymous substitutions per nonsynonymous site. In our analysis, the tree indicates that closely related
mammals were distinctly separated with different clades. The bootstrap
values are shown above the branches [24]. In GATA2 gene, there was a
close relationship between the rate of nonsynonymous substitution of
GATA2 gene in M. musculus and R. norvegicus but distinctly different
from H. sapiens [24].
3.9. Relative abundance of dinucleotide and codon usage patterns
Literature suggested that dinucleotide bias can influence the overall
codon usage patterns in a variety of organisms [6,16]. We computed the
relative abundance of 16 dinucleotides in the coding sequences of
BRCA1 and AARS genes in order to assess the effect of dinucleotides on
the codon usage patterns of these genes across selected mammals under
study. Our analysis showed that CpG dinucleotides were under-represented
(mean ± SD = 0.18 ± 0.011,
mean ± SD = 0.46 ±
0.025) whereas GpC dinucleotides were over-represented (mean ±
SD = 1.04 ± 0.016, mean ± SD = 1.01 ± 0.009) in the CDS of
BRCA1 and AARS genes respectively. Moreover, the codons (Table 4)
containing CpG dinucleotide (TCG, CCG, ACG, GCG, CGA, CGC, CGG
and CGT) had RSCU values < 1.0 indicating their under representation
for corresponding amino acids in BRCA1 gene. Similar result was also
observed in AARS gene except that the codons CGA and CGG had RSCU
value > 1.0. However, the codons containing GpC dinucleotide (i.e.
TGC, CGC, GGC, GCC, GCG) had RSCU value < 1.0 except AGC, GCA
and GCT in BRCA1 gene. Again, in AARS gene the dinucleotide containing GpC (i.e. TGC, CGC, GGC, GCC, AGC, GCA) had RSCU
value < 1.0 except the codons AGC and GCT. Similarly, the dinucleotide TpG containing four codons (TTG, CTG, TGT, GTG) and CpA
containing five codons (CCA, CAT, CAG, ACA and GCA) were over-represented in the CDS of BRCA1 gene and most of them were also used as
preferred codons for their corresponding amino acid (Fig. 5). Besides,
the TpG containing codon CTG and CpA containing codon CAG were
over-represented in the CDS of AARS gene. Previous study reported that
CpA and TpG were over-represented in different organisms and this
could be due to the role of CpG dinucleotide. Spontaneous deamination
(mutation) of methylated cytosine in CpG dinucleotide results in thymine (T) residue forming the dinucleotide TpG in the same strand and
CpA on the opposite strand of DNA after replication [4]. Over-representation of TpG and CpG dinucleotides in BRCA1 and AARS genes
across mammals clearly suggest that mutation through spontaneous
deamination might have affected the codon usage pattern of these genes
during evolution. Our analysis also revealed that CpA and TpG
3.12. Selection pressure acting on the protein-coding BRCA1 and AARS
genes
The ratio of nonsynonymous substitution per nonsynonymous site to
the number of synonymous substitution per synonymous site (dN/dS
ratio) is a good indicator of the extent of selective pressure acting on the
protein coding gene [19]. If dN/dS ratio is greater than unity it indicates positive or Darwinian selection where nature supports amino
acid change in protein. The ratio less than one reveals purifying selection where nature suppresses the alteration of amino acid whereas
the ratio equal to unity points towards neutral selection [1]. In our
analysis (Table 5), the dN/dS ratio for the coding sequences of BRCA1
gene was greater than one when Homo sapiens was compared with P.
troglodytes, M. fascicularis, M. mulatta, P. anubis, M. talapoin, indicating
positive or Darwinian selection during their evolution. However, dN/dS
ratio less than one was observed between Homo sapiens and each of P.
pygmaeus, N. leucogenys, N. gabriellae suggesting that BRCA1 gene has
undergone purifying selection to preserve its protein functionality in
these organisms [23]. Similar result (dN/dS < 1) was also observed in
the CDS of AARS gene when Homo sapiens was compared with other
mammals like P. troglodytes, P. pygmaeus abelii, M. fascicularis, M. mulatta and P. anubis. In GATA2 gene, the mean rate of synonymous
substitution per synonymous site (dS) was higher in H. sapiens, M.
musculus, S. scrofa and B. Taurus, however these data showed no statistically significant difference between the groups. While the rate of
9
Genomics xxx (xxxx) xxx–xxx
S. Chakraborty et al.
nonsynonymous substitution per site (dN) for GATA2 gene was also
higher in H. sapiens, M. musculus and B. taurus, but relatively lower in S.
scrofa and R. norvegicus. However, all the nonsynonymous substitutions
between the groups showed strong, statistically significant differences
(p < 0.001) [24].
[20] W.H. Li, Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codons, J. Mol. Evol. 24 (1987) 337–345.
[21] E. Louie, J. Ott, J. Majewski, Nucleotide frequency variation across human genes,
Genome Res. 13 (2003) 2594–2601.
[22] M. Mahadevappa, J.A. Warrington, Housekeeping Genes, eLS, 2002.
[23] T.H. Mazumder, S. Chakraborty, Gaining insights into the codon usage patterns of
TP53 gene across eight mammalian species, PLoS One 10 (2015) e0121709.
[24] T.H. Mazumder, A. Uddin, S. Chakraborty, Transcription factor gene GATA2: association of leukemia and nonsynonymous to the synonymous substitution rate
across five mammals, Genomics 107 (2016) 155–161.
[25] H. Mirsafian, A. Mat Ripen, A. Singh, P.H. Teo, A.F. Merican, S.B. Mohamad, A
comparative analysis of synonymous codon usage bias pattern in human albumin
superfamily, Sci. World J. 2014 (2014).
[26] R.R. Nair, M.B. Nandhini, T. Sethuraman, G. Doss, Mutational pressure dictates
synonymous codon usage in freshwater unicellular alpha - cyanobacterial descendant Paulinella chromatophora and beta - cyanobacterium Synechococcus elongatus PCC6301, Spring 2 (2013) 492.
[27] A. Necsulea, J.R. Lobry, A new method for assessing the effect of replication on DNA
base composition asymmetry, Mol. Biol. Evol. 24 (2007) 2169–2179.
[28] A. Necşulea, J.R. Lobry, Revisiting the directional mutation pressure theory: the
analysis of a particular genomic structure in Leishmania major, Gene 385 (2006)
28–40.
[29] M. Nei, T. Gojobori, Simple methods for estimating the numbers of synonymous and
nonsynonymous nucleotide substitutions, Mol. Biol. Evol. 3 (1986) 418–426.
[30] R. Nielsen, Z. Yang, Estimating the distribution of selection coefficients from phylogenetic data with applications to mitochondrial and viral DNA, Mol. Biol. Evol. 20
(2003) 1231–1239.
[31] M. Nirenberg, P. Leder, M. Bernfield, R. Brimacombe, J. Trupin, F. Rottman,
C. O'Neal, RNA codewords and protein synthesis, VII. On the general nature of the
RNA code, Proc. Natl. Acad. Sci. U. S. A. 53 (1965) 1161–1168.
[32] G.A. Palidwor, T.J. Perkins, X. Xia, A general model of codon bias due to GC mutational bias, PLoS One 5 (2010) e13431.
[33] A. Pavlicek, V.N. Noskov, N. Kouprina, J.C. Barrett, J. Jurka, V. Larionov, Evolution
of the tumor suppressor BRCA1 locus in primates: implications for cancer predisposition, Hum. Mol. Genet. 13 (2004) 2737–2751.
[34] G. Perriere, J. Thioulouse, Use and misuse of correspondence analysis in codon
usage studies, Nucleic Acids Res. 30 (2002) 4548–4555.
[35] Y. Prat, M. Fromer, N. Linial, M. Linial, Codon usage is associated with the evolutionary age of genes in metazoan genomes, BMC Evol. Biol. 9 (2009) 285.
[36] E.M. Rosen, S. Fan, R.G. Pestell, I.D. Goldberg, BRCA1 gene in breast cancer, J. Cell.
Physiol. 196 (2003) 19–41.
[37] N. Sueoka, Directional mutation pressure and neutral molecular evolution, Proc.
Natl. Acad. Sci. U. S. A. 85 (1988) 2653–2657.
[38] N. Sueoka, Y. Kawanishi, DNA G+ C content of the third codon position and codon
usage biases of human genes, Gene 261 (2000) 53–62.
[39] K. Tamura, G. Stecher, D. Peterson, A. Filipski, S. Kumar, MEGA6: molecular evolutionary genetics analysis version 6.0, Mol. Biol. Evol. 30 (2013) 2725–2729.
[40] J.D. Thompson, D.G. Higgins, T.J. Gibson, CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice, Nucleic Acids Res. 22 (1994)
4673–4680.
[41] M. Touchon, E.P. Rocha, From GC skews to wavelets: a gentle guide to the analysis
of compositional asymmetries in genomic data, Biochimie 90 (2008) 648–659.
[42] A. Uddin, S. Chakraborty, Synonymous codon usage pattern in mitochondrial CYB
gene in pisces, aves, and mammals, Mitochondrial DNA (2015) 1–10.
[43] A. Uddin, S. Chakraborty, Codon usage trend in mitochondrial CYB gene, Gene 586
(2016) 105–114.
[44] H.C. Wang, D.A. Hickey, Rapid divergence of codon usage patterns within the rice
genome, BMC Evol. Biol. 7 (Suppl. 1) (2007) S6.
[45] L. Wei, J. He, X. Jia, Q. Qi, Z. Liang, H. Zheng, Y. Ping, S. Liu, J. Sun, Analysis of
codon usage bias of mitochondrial genome in Bombyx mori and its relation to
evolution, BMC Evol. Biol. 14 (2014) 262.
[46] E.H. Wong, D.K. Smith, R. Rabadan, M. Peiris, L.L. Poon, Codon usage bias and the
evolution of influenza a viruses. Codon usage biases of influenza virus, BMC Evol.
Biol. 10 (2010) 253.
[47] F. Wright, The ‘effective number of codons’ used in a gene, Gene 87 (1990) 23–29.
[48] C. Xu, X. Cai, Q. Chen, H. Zhou, Y. Cai, A. Ben, Factors affecting synonymous codon
usage bias in chloroplast genome of oncidium gower ramsey, Evol. Bioinformatics
Online 7 (2011) 271–278.
[49] Z. Zhang, W. Dai, D. Dai, Synonymous codon usage in TTSuV2: analysis and comparison with TTSuV1, PLoS One 8 (2013) e81469.
Acknowledgements
We are thankful to Assam University, Silchar, Assam, India, for
providing the necessary lab facilities in carrying out this research work.
Ethics statement: Not applicable. The study is based on analysis of
DNA sequences available in public databases accessible to everyone.
Conflict of interest
There is no conflict of interest in this research work.
References
[1] M. Anisimova, D.A. Liberles, The quest for natural selection in the age of comparative genomics, Heredity (Edinb) 99 (2007) 567–579.
[2] S. Aota, T. Gojobori, F. Ishibashi, T. Maruyama, T. Ikemura, Codon usage tabulated
from the GenBank genetic sequence data, Nucleic Acids Res. 16 (Suppl) (1988)
r315–402.
[3] S.K. Behura, D.W. Severson, Comparative analysis of codon usage bias and codon
context patterns between dipteran and hymenopteran sequenced genomes, PLoS
One 7 (2012) e43111.
[4] A.P. Bird, DNA methylation and the frequency of CpG in animal DNA, Nucleic Acids
Res. 8 (1980) 1499–1504.
[5] A.M. Butt, I. Nasrullah, Y. Tong, Genome-wide analysis of codon usage and influencing factors in chikungunya viruses, PLoS One 9 (2014) e90905.
[6] M.L. Chiusano, F. Alvarez-Valin, M. Di Giulio, G. D'Onofrio, G. Ammirato,
G. Colonna, G. Bernardi, Second codon positions of genes and the secondary
structures of proteins. Relationships and implications for the origin of the genetic
code, Gene 261 (2000) 63–69.
[7] J.M. Comeron, M. Aguade, An evaluation of measures of synonymous codon usage
bias, J. Mol. Evol. 47 (1998) 268–274.
[8] L. Duret, D. Mouchiroud, Expression pattern and, surprisingly, gene length shape
codon usage in Caenorhabditis, Drosophila, and Arabidopsis, Proc. Natl. Acad. Sci.
U. S. A. 96 (1999) 4482–4487.
[9] E. Eisenberg, E.Y. Levanon, Human housekeeping genes, revisited, Trends Genet. 29
(2013) 569–574.
[10] S. Fujimori, T. Washio, M. Tomita, GC-compositional strand bias around transcription start sites in plants and fungi, BMC Genomics 6 (2005) 26.
[11] R. Grantham, C. Gautier, M. Gouy, R. Mercier, A. Pave, Codon catalog usage and the
genome hypothesis, Nucleic Acids Res. 8 (1980) r49–r62.
[12] R. Grantham, C. Gautier, M. Gouy, M. Jacobzone, R. Mercier, Codon catalog usage
is a genome strategy modulated for gene expressivity, Nucleic Acids Res. 9 (1981)
r43–74.
[13] B. He, H. Dong, C. Jiang, F. Cao, S. Tao, Xu L-a, Analysis of codon usage patterns in
Ginkgo biloba reveals codon usage tendency from A/U-ending to G/C-ending, Sci.
Rep. 6 (2016).
[14] G.M. Jenkins, E.C. Holmes, The extent of codon usage bias in human RNA viruses
and its evolutionary origin, Virus Res. 92 (2003) 1–7.
[15] X. Jia, S. Liu, H. Zheng, B. Li, Q. Qi, L. Wei, T. Zhao, J. He, J. Sun, Non-uniqueness
of factors constraint on the codon usage in Bombyx mori, BMC Genomics 16 (2015)
356.
[16] S. Karlin, C. Burge, Dinucleotide relative abundance extremes: a genomic signature,
Trends Genet. 11 (1995) 283–290.
[17] K. Komurov, S. Dursun, S. Erdin, P.T. Ram, NetWalker: a contextual network analysis tool for functional genomics, BMC Genomics 13 (2012) 282.
[18] M.A. Kron, M. Petridis, M. Haertlein, B. Libranda-Ramirez, L.E. Scaffidi, Do tissue
levels of autoantigenic aminoacyl-tRNA synthetase predict clinical disease? Med.
Hypotheses 65 (2005) 1124–1127.
[19] S. Kryazhimskiy, J.B. Plotkin, The population genetics of dN/dS, PLoS Genet. 4
(2008) e1000304.
10