ICCCI 2021 Paper 204
ICCCI 2021 Paper 204
ICCCI 2021 Paper 204
1 Introduction
Sentiment Analysis (SA) is a Natural Language Processing (NLP) research field
that spotlights on looking over people’s opinions, sentiments,and emotions. SA
techniques are categorized into symbolic and sub-symbolic approaches. The for-
mer use lexica and ontologies [1] to encode the associated polarity with words
and multiword expressions. The latter consist of supervised, semi-supervised and
unsupervised machine learning techniques that perform sentiment classification
based on word cooccurrence frequencies. Among all these techniques, the most
popular are based on deep neural networks. Some hybrid frameworks leverage
both symbolic and sub-symbolic approaches.
SA is based on a multi-step process including data retrieval, data extraction,
data pre-processing, and feature extraction. The ultimate subtasks of sentiment
classification allow three types of classification: polarity classification, intensity
2 H. Chouikhi et al.
classification, and emotion identification. The first type classifies the text as pos-
itive, negative or neutral, while the second type identifies the polarity degree as
very positive, positive, negative or very negative. The third classification identi-
fies the emotion such as sad, anger or happy.
Pratically, arabic language has a complex nature, due to its ambiguity and rich
morphological system. This nature associated to various dialects and the lack
of resources represent a challenge for the progress of arabic sentiment analysis
research.
In this paper, we adress the tokenization challenges of sentiment analysis for ara-
bic language. We also tackle arabic SA by taking into account the improvement
of tokenization level. The rest of this paper is organized as follows: In Section 2,
we present specificities of arabic sentiment analysis. Section 3 overviews existing
works related to ASA. Our proposed method is described in section 4. Section
5 is reserved for the presentation of the results and experiments. Finaly, we end
with a conclusion.
Many researches in literature have proven that sentiment analysis is not a simple
classification problem. SA is a suitcase research problem that requires tackling
different NLP tasks including subjectivity detection, aspect extraction, word po-
larity disambiguation, and time expression recognition.
Besides the general challenges of sentiment analysis such as domain dependency,
polarity fuzziness and spam [2], there are others related to arabic SA. As senti-
ment analysis depends significantly on the morphology of the target language,
Abdul-Mageed et al. [3] listed the linguistic properties of the arabic language in
terms of varieties, orthography, and morphology.
As language varieties, arabic is one of the six official languages of the united
nations, and the mother tongue of about 300 million people in 22 different coun-
tries, including standard arabic and dialects. Modern standard arabic (MSA)
is the formal language of communication understood by the majority of arabic
speaking people, as it is commonly used in radio, newspapers, and television.
The arabic language is known by its morphological complexity and richness. The
same word may carry important information using suffixes, affixes and prefixes
[4]. An arabic word reveals several morphological aspects including derivation,
inflection, and agglutination.
A significant factor of an accurate sentiment analysis system is the use of large
annotated corpora. The accuracy increases with the quality and the size of the
training corpus of the sentiment classifier. Arabic language is still poor in terms
of tests corpora which represents a well known problem for sentiment analysis.
In addition, the few available datasets are dialectically limited, or even free from
dialectical content. To the best of our knowledge, there are no arabic corpora
annotated for sentiment analysis and fully covering the different dialects.
MSA lexica are small compared to english lexica. Accordingly, many works try
to translate large english lexica to arabic. However, the resulted coverage is poor
Arabic Sentiment Analysis using BERT model 3
hULMonA. They fine tuning multi-lingual BERT (mBERT) ULM for ASA. They
collected a benchmark dataset for ULM evaluation with sentiment analysis.
Antoun et al. [15] developed an arabic language representation model to im-
prove the state-of-the-art in several Arabic NLU tasks. They created AraBERT
based on the BERT model. They used the BERT base configuration that has
12 encoder blocks, 768 hidden dimensions, 12 attention heads, 512 maximum
sequence length.
Despite it is one of the main steps in any languages processing step, only few
recent studies attempted to evaluate word embedding of arabic texts. Mohamed
A. Zahran [16] translated the word2vec english benchmark and used it to evalu-
ate the different embedding techniques on a large arabic corpus. However, they
reported that translating an english benchmark is not a good strategy to evalu-
ate arabic embedding.
In this paper, we used an arabic version of the BERT model: Arabic BERT [13]
that is trained from scratch and made publicly available for use. Arabic BERT
was a set of BERT language models that consists of four models of different sizes
trained using masked language modeling with whole word masking. Models with
large, base, medium, and mini sizes [13] were trained with the same data for 4M
steps ( Table 1).
BERT tokenizer [18] was trained using the WordPiece tokenization. It means
that a word can be broken down into more than one sub-words. The vector
6 H. Chouikhi et al.
4 Proposed method
Among all cited works, the approach of Ali Safaya [13] is the most close to
our approach. Figure 4 depicts the proposed architecture for arabic SA. Our
architecture is composed by 3 blocks. The first block describes the text pre-
processing step where we used an arabic BERT tokenizer to split the word into
tokens. Second block is the training model. Arabic BERT model is used with
only 8 encoder (Medium case [13]). The output of last four hidden layers is
concatenated to get a size representation vector 512x4x128 with 16 batch size
Arabic Sentiment Analysis using BERT model 7
(32 for AJGT dataset). The pooling operation’s output is concatenated and
flattened to be later on crossed a dense layer and a Softmax function to get the
final label. Third block is about the classifier where we used a dropout layer for
some regularization and a fully-connected layer for our output. The choice of
maximum token length is validated by a test with the AJGT dataset ( see Fig
5).
96.11
96
Table 2: Hyper-parameters used in the
95.55
approach.
Accuracy (%)
95.5
95
Hyper-parameters Value
Batch-size 16 (32 for AJGT)
94.5 94.44 94.44 dropout 0.1
Max length 128
32 64 128 256 Hidden size 512
Maximum token length
lr 2e-5
AJGT
Optimizer AdamW
Epochs 10/20/50
Fig. 5: Optimal value of maximum token
length
Table 3: Differences between the proposed approach, AraBERT [15] and Arabic
BERT [13] models.
Batch-size Epochs Layers Activation function
Our approach 16/32 10/20/50 8 Softmax
Arabic BERT[13] 16/32 10 12 ReLU
AraBERT[15] 512/128 27 12 Softmax
87 86 93 91 100
Accuracy (%)
79 81 81 72 82
70 71
50
0
C ON
N W
ch
C STMLS NN
D N/ /C M
1
G R
A hU NB
pr RT
O A icB onA
ap BERT
-G
E- CR N
oa
T
ur ra E
C
ra LM
b
N
L
ASTD
Fig. 6: Comparison between classical and deep learning approaches with ASTD
datasets
6 Conclusion
This paper proposes a BERT based approach to sentiment analysis in arabic.
This study clearly demonstrated that Arabic Sentiment Analysis (ASA) has be-
come one of the research areas that have been drawn the attention of many
researchers.
Numerical results show that our approach outperform the existing ASA ap-
proach. Many challenges need to be sorted out so as to design an effective and
mature sentiment analysis system. Most of these challenges are inherited from
the nature of the arabic language itself. As future works, we will try to overcome
these challenges.
References
1. Dragoni, Mauro and Poria, Soujanya and Cambria, Erik.(2018) ” OntoSenticNet: A
commonsense ontology for sentiment analysis”. IEEE Intelligent Systems33(3):77-
85.
2. Oumaima Oueslati , Erik Cambria , Moez Ben HajHmida , Habib Ounelli.(2020)
”A review of sentiment analysis research in Arabic language”.Future Generation
Computer Systems.112.408–430.
3. Abdul-Mageed, Muhammad and Diab, Mona and Korayem, Mohammed.(2011)
”Subjectivity and sentiment analysis of modern standard Arabic”. Proceedings of
the 49th Annual Meeting of the Association for Computational Linguistics Human
Language Technologies short papers-Volume 2 Association for Computational Lin-
guistics. 587–591.
4. Amira Shoukry and Ahmed Rafea.(2012)”Sentence-level arabic sentiment analy-
sis”.Collaboration Technologies and Systems (CTS) 2012 International Conference
on IEEE.546–550.
5. Wajdi Zaghouani.(2017)”Critical survey of the freely available Arabic cor-
pora”.https://arxiv.org/abs/1702.07835.
6. Imran, Azhar and Faiyaz, Muhammad and Akhtar, Faheem. (2018)”An enhanced
approach for quantitative prediction of personality in facebook posts”.International
Journal of Education and Management Engineering (IJEME). 8(2):8-19.
Arabic Sentiment Analysis using BERT model 11