Nothing Special   »   [go: up one dir, main page]

Deep Learning and Multilingual Sentiment Analysis On Social Media

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Applied Soft Computing 107 (2021) 107373

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Deep learning and multilingual sentiment analysis on social media


data: An overview

Marvin M. Agüero-Torales a , , José I. Abreu Salas b , Antonio G. López-Herrera a
a
Department of Computer Science and Artificial Intelligence, University of Granada, Calle Daniel Saucedo Aranda, s/n, 18071, Granada, Spain
b
University Institute for Computing Research, University of Alicante, Carretera de San Vicente del Raspeig s/n, Alicante, Valencia, Spain

article info a b s t r a c t

Article history: Twenty-four studies on twenty-three distinct languages and eleven social media illustrate the steady
Received 12 June 2020 interest in deep learning approaches for multilingual sentiment analysis of social media. We improve
Received in revised form 24 February 2021 over previous reviews with wider coverage from 2017 to 2020 as well as a study focused on the
Accepted 25 March 2021
underlying ideas and commonalities behind the different solutions to achieve multilingual sentiment
Available online 1 April 2021
analysis. Interesting findings of our research are (i) the shift of research interest to cross-lingual and
MSC: code-switching approaches, (ii) the apparent stagnation of the less complex architectures derived from
68-02 a backbone featuring an embedding layer, a feature extractor based on a single CNN or LSTM and a clas-
68T50 sifier, (iii) the lack of approaches tackling multilingual aspect-based sentiment analysis through deep
68T07 learning, and, surprisingly, (iv) the lack of more complex architectures such as the transformers-based,
91D30 despite results suggest the more difficult tasks requires more elaborated architectures.
Keywords: © 2021 Elsevier B.V. All rights reserved.
Sentiment analysis
Multilingual
Cross-lingual
Code-switching
Deep learning
Natural language processing (NLP)
Social media

1. Introduction code-switching content, where users express their opinions using


a mixture of languages in the same sentence.
Sentiment Analysis (SA) allows us to automatically evalu- Multilingual Sentiment Analysis (MSA) is an attempt to ad-
ate the opinion of people toward products, services, and other
dress those issues through several strategies. For example, taking
entities. This knowledge can help to make better decisions look-
advantage of resource-rich languages to perform SA in a resource-
ing to improve key performance indicators. Besides, the mas-
sive adoption of social media such as Facebook and Twitter, poor language as characteristic in cross-lingual sentiment anal-
platforms for e-commerce and services like Amazon, and even ysis. Also, developing language-independent models capable to
review-specialized sites such as Rotten Tomatoes, unleashed a handle SA in different languages or a code-switching setup.
vast amount of content to be analyzed. This data is naturally There is a wide spectrum of approaches for SA, for exam-
multilingual and multicultural thus, an analysis based on a sin- ple, [2–5], which can be relied on supervised but also in unsu-
gle language may carry the risk of not capturing the overall
pervised methods that exploit sentiment lexicons, grammatical
insights [1]. Moreover, important challenges can prevent fully
analysis, and syntactic patterns. In Sections 2.1 and 2.3 we include
leverage this data. Except for a few cases, e.g., English, most
languages lack well-maintained resources widely used for SA a panoramic of the different formulations of this task as well
such as annotated corpus and lexicons. Second, it could be not as the evolution of SA and MSA. More recently, deep learning
straightforward to adapt the same SA model to different lan- (DL) approaches have become a trend leading to state-of-the-
guages, for example, due to variations in word order or usage, art results, with authors such as [6–8] exploring Convolutional
or the noise introduced by machine translation. Also, we have Neural Networks, Adversarial Networks, and Recurrent Neural
Networks among other models. In Section 2.2 we resume some
∗ Corresponding author.
of the advancements of deep learning for SA as an introduction
E-mail addresses: maguero@correo.ugr.es (M.M. Agüero-Torales),
ji.abreu@ua.es (J.I. Abreu Salas), lopez-herrera@decsai.ugr.es for the main topic of this work, the applications of deep learning
(A.G. López-Herrera). in multilingual sentiment analysis in social media.

https://doi.org/10.1016/j.asoc.2021.107373
1568-4946/© 2021 Elsevier B.V. All rights reserved.
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

Using the methodology detailed in Section 3 as a guideline, 2.2. Deep learning on sentiment analysis task
we curated and reviewed 24 relevant research papers. We cat-
egorized them as regard the main idea in multilingual, cross- Deep Sentiment Analysis (DSA) relies on the great potentials
lingual or code-switching approaches, covered in 4.1, 4.2 and of DL showed for NLP tasks. Here, we briefly commented on some
4.3 respectively. For each one, we discuss its distinctive contri- examples to illustrate how DL has been leveraged within the SA.
butions, the experimental setup, corpus, and main results. Also, Word embeddings are used for language modeling and feature
Section 4.4 includes a comparative that allows a quick and broad learning. They are commonly used as an input of the DL mod-
view of the advances in the domain. This analysis drew interest- els, being Word2Vec [24] and GloVe [25] two frequently used
ing conclusions such as the few works, to date, leveraging recent approaches. Also, there are contextualized embeddings such as
developments in contextual embedding. Other main findings and ELMo [26], which represent better the polysemy of the words.
conclusions are covered in Sections 5 and 6 . Besides using pre-trained embeddings, they can be learned to
As sentiment analysis and deep learning approaches have been encode some specific task semantics. In the context of SA, this
growing as an important research field, there have been early approach has been explored in works such as [27] and [28].
efforts [1,9–13] to systematize the knowledge corpus in this do- Another trending field within DL is the attention mechanism,
main, works extended lately by [14–16]. Recently, [17] examined which allows the model to non-uniformly weigh the contribution
the fundamentals of the multilingual case. However, more than
of the context when computing the output. This is another choice
seventeen works we identified introducing or exploring specific
that is being used frequently in SA, for example, to capture the
ideas for MSA have not been studied by the previous reviews.
interaction between aspects and their context as in [29,30].
Another of our main contributions is to drive the review by the
Also, there has been a great interest in novelty architectures
underlying hypothesis of each work, not only analyzing them as
for SA. One approach that has been received considerable at-
regards the type of neural network they used. This is important
tention when working at the document level, is the design of
since the same task can be tackled by very different ideas. Also,
hierarchical models which learn a representation for sentences
we focused the analysis on the current three major strategies for
MSA: multilingual, cross-lingual, and code-switching. This high- from its words, on top of this level, another model can learn rep-
level view of the domain can help to unveil interesting patterns resentations for documents. Different alternatives such as Con-
more than the type of neural network implemented. For example, volutional Neural Networks (CNN) or Long Short-Term Memory
the use of adversarial training to learn language-agnostic features. (LSTM) can be used at each level. Works in [31,32] and [33] are
some examples of this approach.
2. Preliminaries As the last additional example of the ideas that have been
explored, not only for monolingual SA but also for MSA we men-
In this section, we cover the fundamental concepts in Senti- tion the use of adversarial learning to produce a set of domain-
ment Analysis and Multilingual Sentiment Analysis. Also, some of independent features. This is the hypothesis of works such as [34]
the antecedents about the applications of Deep Learning to this and [35] for cross-domain SA.
problem. The concepts discussed in this section do not exhaust the
applications of DL to SA, a more complete revision can be found
2.1. Sentiment analysis on social media in [36].

Starting from Wiebe et al. [18] work in the late 90s, there has 2.3. Multilingual sentiment analysis
been a surge of interest in the different setups of SA. In general,
it can be done at a document, sentence, or aspect level [5] and In this section, we introduce the fundamentals of multilingual
the classification in terms of positive, negative, or neutral, but sentiment analysis as well as some of the earlier applications of
also other more fine-grained scales such as a ranking from 1 to DL to MSA. Initially, the applications of SA have been developed
5. This attention over SA is closely tied to Social Media in its basically for one language, English in most cases, but the multilin-
key role in the rise of modern SA particularly with the works of gual nature of Social Media has shifted the field to a multilingual
Pang et al. [19], and Turney [20], in 2002. The first used machine analysis. Also, advances in SA backed by DL have made it possible
learning (ML) classification techniques over movie review data
to include low resource languages and avoid the use of translation
outperforming human-produced baselines. The second achieved
tools.
an average accuracy of 74% for his recommendations based on
A frequent approach for MSA is called Cross-Language (also
online reviews, which used Semantic Orientation (SO) applied
Cross-Lingual) Sentiment Classification [37] which relies on ma-
to unsupervised classification. Later, Pang and Lee (2008) [2],
chine translation [38,39]. For example, [40] reported an improve-
focused on the fundamentals and basic applications of SA, with
ment over non-DL classifiers (SVM — Support Vector Machine)
a list of resources such as lexicons or datasets.
translating from other languages (Hindi, Marathi, Russian, Dutch,
A comprehensive review that shows the maturity of SA up to
2012 can be found in the book of Liu [4]. This work covers most of French, German, Portuguese, Spanish, and Italian) to English to
the topics, definitions, research problems (e.g., opinion spam de- use augmented word embeddings together with a CNN model.
tection), types of opinions (such as explicit and implicit opinions), However, the cross-language approach carries several issues and
and classification algorithms for SA. In 2013 Feldman [21] and weaknesses, for example, notable discrepancies in the data distri-
Cambria et al. [22] wrote about the basic techniques, key tasks, bution, potential cultural distances even in a perfect translation,
and applications as well as the evolution of the field. hard and costly translation tasks for large corpora with issues as
Another source to take the pulse of the continuous advances charges, availability, performance [37].
in the field has been the tasks related to SA in Twitter hosted Another usual setup for SA is the code-switching one, also
by the International Workshop on Semantic Evaluation (SemEval) called code-mixing or code-mixed. In this case, the content to be
from 2013 to 2017 and in 2020 for code-switching text. From the analyzed is expressed alternating two or more languages even in
latest results, we can corroborate a shift toward the application of a single sentence. One early approach that takes on this problem
deep learning with 20 out of 48 systems participating in SemEval with DL is Wang et al. [41] for Chinese–English. They proposed a
2017 [23]. In the next section, we overview some of the recent bilingual attention LSTM to perform SA in a corpus from Weibo.
advances in DL applied to SA without considering the multilingual com capturing the informative words from both the bilingual and
task. monolingual contexts. Code-mixing is frequent in social media
2
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

sites such as Facebook and Twitter in countries with a large 4. Deep learning techniques for multilingual sentiment
part of its population speaking more than one language, for analysis on social media
example in India where several official and non-official languages
are used. This is an active research area for Hindi–English. An In this section, we review a large corpus of research related
early proposal [42] is based in MLP (Multi-Layer Perceptron) with to the applications of deep learning to multilingual sentiment
word-level features for Hindi–English and Bengali–English Face- analysis. Instead of driving the analysis by the type of architecture
book posts. A more detailed summary of SA for Indian languages or techniques, we choose to organize the works by its sub-domain
with a special focus on the code-mixed text can be found in [43]. within MSA, i.e., multilingual, cross-lingual, or code-switching
There has been a lot of interest to systematize the advances in approaches. Inward each category, we proceeded chronologically
MSA. For example, [44] and [45], the latter reviewed the principal to track the evolution of the field but separating the aspect-based
directions of research focusing on the development of resources studies since in general, they lead to very specific architectures.
and tools for multilingual subjectivity and SA and addressed both Also, we aim to provide a high-level perspective considering the
multilingual and cross-lingual methods. Singhal & Bhattacharyya underlying hypothesis of each work. Finally, the epigraph 4.4 aids
(2016) [11] described some of the different approaches used in the reader to take a glimpse of the domain as regards the models,
SA research. Lo et al. (2017) [1] revised various of the main baselines, corpus, core ideas, and languages covered. For clarity,
approaches and tools used for MSA at the time. They identified and due to the variety of datasets and languages covered by each
challenges and provided several recommendations with a frame- work, Table 1 illustrates the corpus and reports the number of
work for dealing with scarce resource languages. Also, in [13] tweets/sentences/documents used.
and [46] we can find revisions of the field, however, they did not
delve into the use of DL in MSA. 4.1. Multilingual approaches

This category groups a large set of works that aims to be


3. Methodology
language-agnostic to those seen during the training. Common
goals are the design of systems capable to learn directly from
Our research methodology comprises four steps to realize our multilingual unpaired content and providing predictions regard-
main goal: to identify the underlying ideas and commonalities less of the source language. Across the analysis, we will use
behind the different solutions to achieve multilingual sentiment acronyms or abbreviations of common domain concepts without
analysis as well as to suggest future research directions. their definition for the sake of space.
(i) Define the research scope: the applications of deep learning
to multilingual sentiment analysis on social media from Jan- 4.1.1. Sentence-based studies
uary 2017 to December 2020. We choose this timeframe because Training the same model for different languages is explored
the shift toward the application of DL-based sentiment analysis by [50]. They fit a multilayered CNN in two phases. First, they
happened in 2017 as showed in [23]. learn word embeddings from a large corpus of 300M 1 unlabeled
(ii) Article search: our search terms were, (a) deep learning tweets in English, Italian, French, and German. The parameters are
AND (b) sentiment analysis AND (c.1) multilingual OR multi- optimized further during the second stage, trying to infer weak
language OR multilanguage; (c.2) crosslingual OR cross-lingual OR labels inferred from emoticons. Finally, they fine-tuned the model
cross-language OR crosslanguage; (c.3) code-mixed OR using a corpus of annotated tweets. Experiments evaluated a
codemixed OR code-mixing OR codemixing OR code-switching model (FML-CNN) trained in all languages at once, a model fitted
OR codeswitching; (c.4) bilingual OR bi-lingual. Search queries in a single language (SL-CNN), and other variations. The results
were run in Scopus and the Web of Science. showed that FML-CNN reaches slightly worse performance, about
(iii) Article verification: the search yielded 96 studies that 2.45% lower F1 score, compared to SL-CNN (67.79% for Italian).
were examined to ensure they satisfy the following criteria. (a) However, experiments suggest that FML-CNN can handle better
must handle explicitly the multilingualism either by (a.1) training for code-mixed text.
with one or more languages and evaluating with a different one Another hypothesis is to exploit character level embeddings to
or others, (a.2) train with a multilingual corpus, i.e., the same achieve language independence. In [37] and [51] authors describe
model sees text in different languages during training regardless language-agnostic translation-free architectures (Conv-Char-S,
of this being at different steps. Thus, we excluded works that sep- Conv-Char-R) for Twitter based on a CNN that can be trained in
arately trained and evaluated the same architecture in different several languages at once. They evaluated their approach using
languages, i.e., created one model for each language trained only tweets in English, Portuguese, Spanish, and German from the
with data from the given language. Candidates for deletion were corpus in [52] achieving an F1 score above 72.2% [51] for the
verified by the three authors. multilanguage setup. The slightly worst results for some baselines
We also revised the citations from the selected studies as such as LSTM-Emb [53] can be a trade-off since the models have
well those referenced by previous reviews [1,12,14–17,46–49] to ≈ 90 times fewer parameters and use ≈ 4 times less memory.
identify possible candidates, applying the filters (a.1, a.2). In total, The idea of multilanguage character embeddings is explored
24 publications were eligible for review. also by [54] but mapping each character to its UTF-8 integer code.
(iv) Research analysis: for each of the selected papers we The architecture (UniCNN) is similar to [37,51], placing a CNN
extracted data and information about (a) research characteristics, after the embedding layer, with a fully connected classifier at the
as authors, year of publication, languages covered, methodology, top. They used a subset of the Twitter corpus in [52]. The UniCNN
corpus characteristics; (b) sentiment level and categorization (bi- achieved accuracy ≥ 75.45% for all languages. Moreover, except
nary, ternary, or fine-grained, e.g., rates 0 − 5); (c) deep learning for English, they outperformed models that require translation
architectures and techniques, and (d) results and effectiveness of or/and tokenization such as TransCNN (Word), a similar architec-
the proposal against baselines or state-of-the-art models. We also ture that operates at word level and translated text (79.57% for
English).
reviewed each work to identify the underlying idea to handle
multilingualism, with a focus on the results that assessed the
hypothesis. 1 M: Million.

3
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

Table 1
Multilingual approaches used in DSA.
Proposed Baseline Proposed C1 L2 Language Corpus Reference
model* model* approach

CNN translate-single-lang-CNN, Trained large amounts of data in T D & M English, Italian, French, & Unlabeled tweets Deriu et al.
multi-lang-CNN, multi-lang- various languages (is trained for German (300M), (2017) [50]
without-identification-CNN, every single language), in three weakly-supervised data
Random Forest phases: unsupervised, distant (40 − 60M), and
supervised, and supervised with annotated tweets (71K )
multi-layer CNN

CNN LSTM-Emb, Conv-Emb, Cost-effective Character-based B D English, German, Annotated tweets Becker
Conv-Emb-Freeze, Conv-Char & embedding and optimized Portuguese & Spanish (128K , subset of 1.6M) et al.
SVM convolutions (2017) [51]

CNN LSTM-Emb, Conv-Emb, Character-level embeddings with B D & M English, German, Annotated tweets Wehrmann
Conv-Emb-Freeze, Conv-Char & few learnable parameters Portuguese & Spanish (128K , subset of 1.6M) et al.
SVM (2017) [37]

CNN word-Translation-CNN, Transformed characters into B D English, Polish, German, Annotated tweets Zhang et al.
char-Translation-CNN, 1-gram-SVM, numbers corresponding UTF-8 Slovak, Slovenian & (150K , subset of 1.6M) (2017) [54]
2-gram-SVM decimal codes Swedish

CNN SVM N-gram bilingual mixed B D & M French, English & Greek Labeled restaurant Medrouk &
(English–French) input text source balanced reviews (62.6 Pappa
(based on a Naïve approach) K) (2017) [55]

CNN word-CNN, char-CNN, Word-level & Character-level B D English, German, Annotated tweets Zhang et al.
unicode-char-CNN, CuDNNLSTM embeddings with two Portuguese, Spanish, Polish, (193K , subset of 1.6M) (2017) [56]
(word and char) convolutional channels (one Slovak, Slovenian &
channel for each) Swedish

BiLSTM Average Skip-gram Vectors with LR Shared parameters of twin T M English, Hindi English annotated Choudhary
& CNN-Subword-char-LSTM (siamese) networks with tweets (114K) & et al.
contrastive learning Hindi–English labeled (2018) [57]
sentences of Facebook
posts (3.8K )

CNN random initialized CNN, CNN + Cross-lingual graph-based B D English, Spanish, Dutch, Annotated movie Dong & De
GloVe/FastText/Polyglot propagation (transfer-learning) German, Russian, Italian, reviews (12.2K, Rotten Melo
embeddings, regular CNN from a rich source language with Czech, Japanese & French Tomatoes and (2018) [58]
concatenate standard embeddings of supervised training AlloCine), labeled
GloVe/FastText + multilingual on Amazon reviews to a reviews (20.8K ,
sentiment embeddings (VADER, dual-channel neural architecture TripAdvisor, and
SocialSent or Amazon reviews), Amazon Fine Food) &
dual-channel-CNN + GloVe/FastText labeled tweets (4.8K )
incorporates random initialized,
Polyglot, VADER, SocialSent or
static Amazon reviews embedding

CNN LSTM (one-layer and two-layer), Dictionaries of character and word T M Bambara & French (mixed) Labeled Facebook Konate &
BiLSTM (one-layer and two-layer), indexes to produce code-mixed comments (17K , subset Du (2018)
CNN-LSTM, NB & SVM character and word embedding for of 74K ) [59]
a single NN

MNB + LSTM Subword-LSTM Ensemble of Multinomial Naïve T M English, Hindi Hindi–English labeled Jhanwar &
Bayes with 1 and 2-gram features sentences from Das (2018)
and many-to-one stacked LSTM Facebook posts (3.8K) [60]
over 3-gram encoding of sentences

LSTM CNN N-gram raw corpus-based input, B D & M French, English & Greek Labeled restaurant and Medrouk &
without any preprocessing, hotel balanced reviews Pappa
translation, annotation nor (91.8K) (2018) [61]
additional knowledge features

BiLSTM Doc2Vec + SVM, FastText & CNN Embedding with only distributed T M English, Bengali, Hindi & Labeled sentences from Shalini
representation of the text Kannada (English mixed) Facebook comments et al.
(22.5K ) (2018) [62]

BiLSTM SVM Learning new word embeddings B D English & Greek Labeled TripAdvisor Stavridis
based on limited training datasets reviews (40K ) & et al.
and a pre-trained DNN exploiting annotated tweets (480) (2018) [63]
transfer-learning from a rich
source language with labeled data

CNN + GAN + SVM + Word2Vec, LSTM, CNN, NSC Combined CNN, GAN, and user B D English & Chinese Annotated tweets Wang et al.
Attention- + UPA, UPNN attention to learn specific and (48.1K ) & Weibo posts (2018) [64]
mechanism independent-language features (53.6K )
from data with authorship
information

GAN + DAN DAN, mSDA, Machine Translation + Combined Bilingual Embedding T, F D English, Chinese & Arab English reviews from Chen et al.
DAN, CLD-KCNN, CLDFA-KCNN (BWE), Deep Averaging Network Yelp (700K), hotels (2018) [65]
(DAN) and adversarial training to reviews in Chinese
learn independent-language (20K labeled / 150K
features from a source language unlabeled) & tweets in
(English) Arab (1.2K labeled)

BiLSTM-CNN CNN-hierarchical-BiLSTM, Word vector representation B & T D English, Arabic, French & Binary (400) and Liu et al.
CNN-hierarchical-BiLSTM-gate- improvement based on gate Chinese ternary (3.7K ) labeled (2019) [66]
mechanism, LSTM, CNN, mechanism, which obtains web reviews (4.1K )
LSTM-attention-mechanism, time-series relationship of different
CNN-attention-mechanism, sentences in the comments
RCNN-LSTM, hierarchical-LSTM, through an RCNN, and gets the
LSTM-sentences-relations local features of the specific
aspects in the sentence and the
long-distance dependence in the
whole comment through a
hierarchical attention BiLSTM

BiLSTM-CNN 1-grams + 2-grams-SVM, 1-grams Hybrid architecture with subword T M English, Hindi Hindi–English labeled Lal et al.
+ 2-grams-NB-SVM, 1-grams + level representations for the sentences of Facebook (2019) [67]
2-grams-MNB, Tf–Idf-MNB, Lexicon sentences, two parallel BiLSTM as posts (3.8K )
Lookup, Char-LSTM, Subword-LSTM, Dual Encoder (Collective Encoder
FastText & SACMT for overall sentiment and Specific
Encoder with attention mechanism
for subwords) and linguistic
features network

(continued on next page)

4
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

Table 1 (continued).
Proposed Baseline Proposed C1 L2 Language Corpus Reference
model* model* approach
BiLSTM SVM-MONO, SVM-MT, Low-resource language embeddings F - 1 D English, Spanish & Catalan English labeled tweets Jabreel
ARTEXTE-SVM-based, + mapping function, joined with (33.7K ), Spanish et al.
ARTEXTE-Ensemble, rich-resource language embedding OpeNER (1.3K ) & (2019) [68]
BARISTA-SVM-based, through k-NN refinement, BiLSTM Catalan MultiBooked
BARISTA-Ensemble, BLSE, as encoder layer, then fully (1K )
BLSE-Ensemble & BiLSTM-MT connected layer with softmax for
prediction
Double LSTM Double LSTM with several Low resource + code-mixed corpus T M English & Hindi (mixed) Hindi–English labeled Mukherjee
combinations of optimizers and to train embeddings. Joint feature sentences of Facebook (2019) [69]
loss functions & Subword-LSTM of sentences (subword + word posts (3.8K )
levels), preceded by Double LSTM
layer
LSTM-AAE- MT-SVM, MT-BiGRU & TL-BiGRU Contextual word embeddings D English, Amazon labeled documents Shen et al. (2020) [70]
BiGRU (Word2Vec+LSTM, source and Chinese & (28.9K ) and unlabeled
target languages), AAE B German documents (80K ) for each
pair of language category
(books, DVD, music)
BiLSTM + NB, SMO (SVM), RF, BiLSTM-CNN, Attention mechanism to extract T M English, Hindi & Bengali English, Jamatia
Attention- Double BiLSTM, GloVe + such words that are important to (English mixed) Bengali–English, et al.
mechanism BiLSTM-CNN, GloVe + Double the meaning of the sentence and Hindi–English (2020) [71]
BiLSTM, GloVe + aggregate the representation of annotated tweets
Attention-based-BiLSTM, those informative words to form (9.2K , 5.5K , 18.4K ) &
BERTbase−uncased & the sentence vector; a sigmoid Hindi–English labeled
BERTbase−domain−uncased layer is used to predict the correct sentences of Facebook
label posts (3.8K )
BiLSTM LASER-CNN, FastText-BiLSTM & Transfer learning by LASER with F + 1 D Polish, Dutch, English, Online medicine, Kanclerz
fastText-CNN (low-resource) language corpus, French, German, Italian, hotels, school, products et al.
BiLSTM, then predict the sentiment Portuguese, Russian & reviews (8.4K for each (2020) [72]
of texts in other (high-resource) Spanish language)
language
CNN-BiLSTM NB, BiLSTM & Subword-LSTM Three stages classification with T M English & Kannada (mixed) Annotated YouTube Chundi
subword embeddings + comments (10.4K) et al.
CNN-BILSTM: first positive or not, (2020) [73]
then negative or not, and then,
computed classification matrix of
them
LSTM-CNN BilBOWA + CNN, VecMap + CNN, Train bilingual embeddings B D English & Persian Binary (11K , Persian Ghasemi
BilBOWA + LSTM, VecMap + LSTM, (VecMap, on one high-resource Digikala reviews), five et al.
BilBOWA + CNN-LSTM, VecMap + and other low-resource language) categories (200K , (2020) [74]
CNN-LSTM & BilBOWA + and uses it on target language English Amazon
LSTM-CNN (low-resource), followed by a DL reviews)
classifier for predict polarity

*Bold: model with best performance.


1 Classification (C): B (Binary), T (Ternary), F (Five categories).
2 Multilingualism Level (L): Document (D), Mix (M).

Multilanguage character embeddings are further developed features without distinction if processing single or multilingual
in [56] but within an architecture (Word-Character CNN) that datasets.
processes the text through two parallel CNN, one for words and
the other for characters. The hypothesis is that words and char- 4.1.2. Aspect-based studies
acter features provide complementary information. Outputs from Regardless of the promising results for SA at sentence level
both CNN are merged before being feed to a fully connected clas- that achieves simpler architectures such as [61], it is not a sur-
sifier. To achieve language independence, the embedding layer is prise that for aspect level authors proposed more complex mod-
kept trainable. They used the same Twitter corpus as in [54]. The els.
hybrid model yields a better performance (≥ 77.13%) compared The architecture (GRCNN-HBLSTM) proposed by [66] com-
to pure word/character CNN such as [8] (≥ 74.64%), [37] (≥ bines two word-level feature extractors. A BiLSTM encoding sen-
75.41%) and their former model UniCNNs [54] (≥ 75.45%) for tences that take as inputs the embeddings for the topics, the
languages already studied in [54]. Interestingly, the two romance aspects, and the words. The original word embedding and a
languages considered, Spanish (69.82%) and Portuguese (72.87%), feature vector from a character CNN are combined through a
had the worst performance. gate mechanism to achieve language independence. The second
Medrouk & Pappa [55] studied a similar architecture. It com- encoder is a regional CNN that aims to preserve the temporal
prises a stack of CNN working as a feature extractor, i.e., an relationship between different sentences, also capturing some of
encoder, followed by polling and a fully connected predictor, but the long-distance dependencies of the aspects. Both feature sets
in this case, working at the n-gram level. To this point, CNN are feed to a sentence-level BiLSTM together with an attention
seems a popular choice within the domain in contrast to LSTM. mechanism. A softmax classifier handles the output of the last
The model is feed with reviews written in French, English, and layer. In experiments using a subset of the dataset in [75] their
Greek without any language indication. Empirical evaluation over full model yields an F1 score above 78.04% in all cases outper-
a mix of contents from three languages yielded an F1 score of forming baselines such as the Hierarchical LSTM [31] (≥ 78.04%).
88%. These results reinforce the assumption that the n-gram CNN What is more important, they compared a version (CNN-HBLSTM)
can produce language-independent features capturing the local without the gate mechanism that achieves worsts results (≥
relations between words useful for multilingual polarity analysis. 74.66%) and the highest variance among languages.
Whether CNN and LSTM variants of the embedding-feature So far, we have reviewed the purely multilingual approaches
extractor–classifier architecture need extra pre-training hassle or for SA. We have contrasted very different approaches. However,
additional complexity to handle multilingual data is investigated at the sentence level, the common strategy is to learn features
by [61]. Experiments were conducted training monolingual and from a multilanguage set using CNN and feed a classifier module
multilingual models, achieving accuracy over 90% for both types with those features. Unsurprisingly, for the aspect SA setup, au-
of networks working at the n-gram level. Moreover, the fact that thors embrace more complex architectures leveraging attention
the multilingual models behave as well as the monolingual ones, mechanisms and aspect embeddings. The next section is devoted
seems to confirm the hypothesis about their ability to extract rich to the cross-lingual category.
5
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

4.2. Cross-lingual approaches (AAE) [78] feed from the outputs of two LSTM, one for each of the
source and the target languages. On the top of this module sits
We categorized as cross-lingual the proposals where the focus a classifier based in Bidirectional Gated Recurrent Unit (BiGRU).
is to leverage resource-rich language assets to extrapolate to Evaluating using Amazon comments with English as the source
a low-resource target, for example, through transfer learning. and Chinese and German as the target, the model (TL-AAE-BIGRU)
This is the core of the proposal in [58] for the cross-lingual achieved an F1 score ≥ 78.13%, about 3.44% better as average
projection of sentiment embeddings. Their custom architecture than the model without AEE (≥ 73.25%). This result is consistent
(Dual-Channel CNN) has one channel which works with word with [64,65] who noticed the benefits of the adversarial module
embeddings to extract features through convolutions. The other too.
channel is similar but uses word sentiment embeddings which Instead of adversarial training, in [68] they opted to create
can boost the classification. Features computed from each chan- universal embeddings by the combination of embeddings from
nel are merged before being feed to a fully connected layer. It high and low resources languages. The Universal SA (UniSent),
is worth noting that for low-resource languages, the sentiment involves the pre-training of two BiLSTM on English (labeled
embeddings can be projected from English. They evaluated their tweets), the alignment of low-resource language embeddings to
approach for English as the source and Spanish, Dutch, German, the English ones with an unsupervised and domain-adversarial
Russian, Italian, Czech, Japanese, and French. The induced embed- approach (MUSE [79]), and the fine-tuning on the low-resource
dings lead to better results in 7 out of the 10 trials with accuracy languages validation sets applying an Universal Embedding Layer.
over 79.3%. This layer represents a word in a low-resource language by the
While not common in cross-language SA, in [63] authors ex- weighted average of the most similar words to it in the English
plored the architecture comprised of a feature extractor (BiLSTM) word embedding. The embeddings are feed to the classifier, a
feed by embeddings followed by the classifier (dense layer) for many-to-one LSTM layer. The experiments were carried in texts
transfer learning. First, they use a large dataset to train the whole from OpeNER for Spanish and MultiBooked for Catalan. UniSent
model. Afterward, they fine-tuned in a small, labeled dataset from achieved an F1 score ≥ 81.4% for the binary classification and
the target language, but only the embedding layer remains train- ≥ 54.2%, outperforming even a version tested using translated
able. They trained using TripAdvisor reviews in English for the texts (≥ 74.0% and ≥ 40.6%).
first stage and tweets in Greek for the second, being the results The simpler architecture comprising LASER (Language-
very sensible to the size of this dataset (accuracy drops from Agnostic SEntence Representations) toolkit2 as a language-
91.7% to 73.2% as the dataset shrinks from 400 to 330). As the independent embedding, a feature extractor (CNN or BiLSTM)
authors note, it will be interesting to study how a different degree and the classifier is proposed by [72]. The multilingual embed-
of syntactic similarity between languages influences results. ding aims to drive the model to language-agnostic representa-
Next works reviewed within the cross-lingual category ex- tions, being able to perform SA in languages different from the
plored the idea of using adversarial training to learn a set of ones seen during training. Experiments training with Polish (F1
language-independent features. score 79.91%) to predict other languages seems to evidence this
In [64] the authors delve into the synergies of microblog data premise since in all cases, F1 was ≥ 77.96%. Also, that the setup
in different languages from the same user to extract personalized LASER+BiLSTM is better in general.
language-specific or independent features to alleviate the lack A similar architecture is proposed by [74] but using Bilin-
of data in some sources. The architecture has four components. gual Bag-of-Words without Word Alignments (BilBOWA) and
First, an attention mechanism encoding users as feature vectors to VecMap3 as cross-lingual embedding. Experiments over English
propagate their individuality across the system. The second com- and Persian electronic product reviews evaluated different al-
ponent are encoders [θ 1 , θ 2 ], one for each language, computing ternatives for the embeddings and the feature extractors (CNN,
specific-language features. The third element is the language- LSTM, CNN-LSTM, and LSTM-CNN). The VecMap+LSTM-CNN with
independent encoder θ G that is feed with sentences from both dynamic embeddings, i.e., fine-tuning the embeddings with the
languages as well the user attention vector. Encoders are CNN training data, achieved the best results with an F1 score of 91.82%.
over different n-gram representations of the sentences and an So far, we have reviewed cross-lingual approaches for SMA.
attention mechanism for the user-specific information. The classi- Though there are very different perspectives to solve the problem,
fier is softmax layer for each language with inputs from θ G and θ 1 they also shared some commons ideas, as the projection of re-
or θ 2 . The last module is a Generative Adversarial Network (GA) sources such as sentiment embeddings. The next section is com-
that drives θ G to a set of features useful for SA when combined mitted to reviewing the works that addressed the code-switching
with θ 1 or θ 2 but uncorrelated with the language of the input setup.
sentence. Experiments with Twitter and Sina Weibo compared
monolingual baseline models and the proposal, trained with both 4.3. Code-switching approaches
languages at once. Results show an increase of the F1 score up to
2.12% respect the best monolingual (≥ 79.85%). In this section, we examine the reports that tackled code-
Other work that leverages adversarial training to learn a set of switching sentiment analysis. This setup poses challenges such as
language-independent but highly discriminatory features is [65]. spelling variations, transliteration, informal grammar forms, and
The architecture (ADAN) uses the Deep Averaging Network (DAN) the scarcity of annotated data.
in [76] as a feature extractor with a Bilingual Word Embedding The underlying idea of [57] is to learn a sentiment feature
(BWE) [77] as an input layer. These features are feed to the space preserving the similarity of sentences in terms of the senti-
classifier and a language discriminator acting as an adversarial ment they convey. This enables us to measure the relatedness be-
driving DAN to language-independent features. Empirical evalu- tween code-switched content and labeled data from a resource-
ation shows ADAN (accuracy ≥ 42.49%) improves in at least 6% rich corpus. The Sentiment Analysis of Code-Mixed Text (SACMT)
over a version without the adversarial module (only DAN) trained architecture uses a siamese BiLSTM with tri-gram embeddings as
with English to predict Chinese and Arab. It suggests that the input and a fully connected layer on the top. They compared a
adversarial mechanism is crucial for the results.
The adversarial mechanism is also critical in [70]. They build a 2 https://github.com/facebookresearch/LASER
cross-lingual word embedding using an Adversarial Auto Encoder 3 https://github.com/artetxem/vecmap

6
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

model trained only with pairs of code-mixed text (Hindi–English) layers and a softmax. Their model achieved an F1 score of 66.13%
with an F1 score of 67.2%, to other trained with pairs of sentences over the Hindi–English corpus in [80] improving about 5.5% the
one from English and other code-mixed (75.9%). These results baseline in [80]. As in [56], combining both types of feature
suggest that in effect, the model benefits from the additional data extractors seems to lead to better results.
provided by the English corpus. Recently in [71] authors evaluated several architectures
Most of the research in code-mixed text has focused on the (BiLSTM-CNN, Double BiLSTM, Attention-based), each one with
English–Hindi setup. One exception is the seminal work on code- and without GloVe, and BERT (Bidirectional Encoder Represen-
mixed Bambara–French Facebook comments in [59]. They exam- tations from Transformers) [82] - over tweets and Facebook
ined different variations of the base architecture with embed- comments for English–Hindi and English–Bengali. The same mod-
dings (at word or character level) as input, followed by a feature els were trained in a monolingual corpus to observe the effects
extractor (one of LSTM, BiLSTM, CNN, CNN-LSTM) and finally, of code-mixing. The best model for code-mixing, the Attention-
the classifier. To mitigate the lack of pre-trained embeddings in based model with custom word embeddings, achieved an F1
Bambara, the model learns multilanguage embeddings from char- score average of 0.66 and 0.67 for the monolingual setup. It
acters or words in the code-mixed corpus. The best performing is interesting that the performance of BERTbase−uncased , the best
model was a one-layer CNN model with an accuracy of 83.23%. model for the monolingual with 0.77, decreased noticeably for
The comparison between LSTM and CNN as feature extractors, the code-mixed (about 0.63).
where the latter-one yields better results, is coherent with a The work [73] also relies on subword embeddings. The archi-
noticeable preference for this type of model within the domain. tecture (SAEKCS), similar to [80], includes a CNN layer on top
One problem when working with code-mixed data is the noise of the embeddings to extract local dependencies. Its output is
and the small size of datasets. To alleviate this, in [60] authors processed by a BiLSTM layer after max-pooling, to learn long-
propose to use n-gram embeddings instead of the subword ones term relations. On the top, a fully connected layer acts as the
suggested by [80]. Another novelty idea within the domain is classifier. They evaluated SAEKCS using Kannada–English code-
the model that works as an ensemble of a Multinomial Naïve switching YouTube comments with an accuracy of 77.6%. They
Bayes (MNB) and a recurrent neural network (LSTM or BiLSTM) also assessed a subword LSTM (64.8%) and a BiLSTM (55.9%),
classifier. They trained the MNB using both word-based 1 and suggesting that the short-term dependencies encoded by the CNN
2-gram features while the neural networks models with the char- greatly benefit the model.
acter 3-grams. The results, together with those reported by [80], Finally, we also have reviewed code-switching approaches for
suggest that for the LSTM network, the 3-grams representation SMA. The more trending proposal is to use subword embeddings
is a better option (F1 score of 58.6%) over characters (51.1%). to allow guessing the meaning of unknown/out-of-vocabulary
The subword level encoding achieved 65.2% but, the difference words. We have contrasted very different mixing languages, in
between the architectures can mislead the conclusions. Values for the majority of cases, English mix. The next section is an overview
the ensemble (66.1%) show that this can be of benefit. of the deep learning implementations across the different setups.
Authors of [62] assessed different versions of the architecture
that combines sequentially one feature extractor and a classi- 4.4. An overview of the different deep learning implementations
fier. The first one, a document to vector (Doc2Vec) layer whose
output is feed to an SVM. The other three featured a FastText Up to this point, we have reviewed a comprehensive corpus of
classifier, a BiLSTM, and a CNN with a softmax. The last two had research works that had leveraged deep learning for multilingual
a trainable word embedding layer before the NN. They curated sentiment analysis. We had focused on the underlying hypothesis
a new corpus for Kannada–English (best results for the CNN of the proposed approaches, highlighting what is common or
model with an accuracy of 71.5%). Also, they evaluated using two different. Also, the results on the evaluation corpus.
available Hindi–English and, Bengali–English corpus [81] with the In this section, we aim to make it easier for the reader to have
BiLSTM model achieving slightly better results, 60.22% and 72.2% a quick overview of the material we analyzed.
respectively. Table 1 summarizes the works we reviewed, describing the
In [67] authors delve into whether to use characters, words, or core of the best model architecture (Proposed Model), baselines,
subwords. Also, if projecting code-mixed text to a single feature distinctive ideas (Proposed Approach), the classification cate-
space is a rich enough representation for SA. Their architecture gories, multilingualism level, languages, and information about
(CMSA) combines three parallel feature encoders before a classi- the corpus.
fier of four dense layers. The collective encoder aims to represent The classification categories mainly will be divided into posi-
the overall sentiment of the sentences. It is based on a BiLSTM tive or negative (binary), also with a neutral class (ternary) or five
network whose end states are the features. The specific encoder classes.4
is also a BiLSTM, but in this case, the intermediate states are
also considered features through an attention mechanism. Both 5. Discussion and future directions
encoders take as input the output of a subword level CNN. The
last one is a set of hand-crafted features to augment the informa-
In this section, we discuss the main findings of our study. We
tion supplied to the model. They evaluated the effect of removing
also highlight some unexplored topics that may hint at interesting
some of the components. CMSA achieved an F1 score of 82.7%,
directions for further research.
better than the model only with the specific (80.1%), or only the
collective (79.5%). It seems to support the hypothesis about the
5.1. Languages and social media in MSA
synergies of the different representations.
Another model combining different feature extractors is the
The 24 studies we analyzed covered 23 different languages. In
proposed in [69]. Like [56], they have character-level and word-
most cases English was the resource-rich language, except [72]
level feature encoders. The first one is inspired in [80], stacking
and [59]. However, authors have explored synergies between
a CNN followed by two LSTM layers. It aims to help with lan-
different languages, as shown in Fig. 1. Concerning the social
guage independence, noise, and non-standard spelling. The other
extractor stacks two LSTM layers. The concatenation of the two
feature sets is the input for the classifier, a stack of two dense 4 Sometimes authors remove the neutral class or add another, e.g., ambivalent.

7
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

occurred. The graph shows how CNN is preferred along with


hybrid architectures, where CNN + BiLSTM is the most studied
with [59] and [62] reporting comparisons between models where
they only changed this setting, resulting in better performance
for the CNN models. Instead, results were equivalent in [61] and
better for LSTM in [51]. Thus, seem to be no consensus about this
subject.

5.2.1. Embedding approaches for MSA


Regardless of the domain, most authors rely on some embed-
ding types, with a trend toward learning the embedding along the
training process. There is no common opinion about whether to
work at a character, word, or subword. Authors such as [51] advo-
cate character level embeddings due to their simplicity. However,
they reported slightly better results for the word level case but
in [37] shows that character level could help achieve a language-
Fig. 1. Synergies between languages. A link between two languages indicates independence. In [59] they compared the character and word-
that both has been used simultaneously in a model, as source or target language, based embeddings, with the latter yielding better results. Also,
or to learn multilingual feature spaces. subword embeddings reported outperforming the character ones
by [67].
Unsurprisingly we can find pre-trained subword multilingual
media Twitter and Miscellaneous5 account for most of the works. embeddings, such as [84], which can be useful for MSA and SA for
Fig. 2 the data de-aggregated by language and social media. low-resource languages. However, there are other alternatives to
The next subsection is devoted to the analysis of the relations be explored such as document level [62], a combination of differ-
between the MSA setup and the architecture proposed to deal ent levels [56,66] and even sentiment-driven embeddings [58],
with the problem. universal embeddings [68] or the use of some tools, such as
LASER [72].
5.2. DL architectures for MSA
5.3. Future directions
A comparison between the backbone of the different architec-
tures suggests that in general, for multilanguage sentence-level Finally, we elaborate on the current state of research and
SA, authors have explored a plainer architecture. It leverages provide a pathway for what can be done or needs to be done
trainable embeddings preceding the feature extractor and finally within the following few years.
a classifier layer. Regardless, there is a wide range of alterna-
Little-explored MSA levels. Hitherto, Aspect-Based Sentiment
tives in the design of the classifier, from a single BiLSTM [63]
Analysis has not widely been addressed using multilingual deep
to parallel CNN [56]. Only one study focused on the aspect-
learning approaches. As [66] suggests, tackling this problem may
level SA. So, it may be difficult to draw conclusions related to
require more complex architectures. Moreover, it needs to be
the best architectural decisions to tackle this problem. However,
studied if current proposals can handle mixing setups such as
experiments in [66] suggest that the attention mechanism as well
aspect-based code-switching.
the aspect embedding plays a key role. Moreover, results from
another work [83] show that the mere addition of attention to MSA setup shift across time. Fig. 4 suggest a shift of the inter-
a simpler model such as LSTM or CNN does not lead to state-of- est from multilingual to cross-lingual and code-switching ap-
the-art results. This is indicative of the convenience of combining proaches. In MSA this can be explained since initially most of
feature extractors at different levels for aspect-based SA. the works focused on the multilingual setup evaluating many
For cross-lingual SA network designs are more diverse. How- variations of the same design. Researchers could perceive this
ever, two core ideas can be identified. The use of an adversarial path as depleted. Also, the adoption of transformer-based archi-
module to drive the feature extractors to language-independent tectures such as BERT [82] and even multilingual models such
representations as in [64] or [65]. The other is leveraging train- as Multilingual BERT 6 allows the researchers to focus on fine-
able cross-lingual embeddings or even pre-trained ones such as tuning the models instead of training them from scratch with a
Facebook LASER [72]. multilingual corpus as has been typical for the multilingual setup.
Code-switching backbones tend to be simpler than cross-
lingual, yet more elaborated than the multilingual ones. They Multilingual representations. Multilingual embeddings and ad-
can include parallel encoders at a character, word, or subword versarial training seem to be the most common approaches to
levels [67,69] or implement ensemble models including deep achieve multilingualism within the analyzed corpus. But there is
neural networks and other classification techniques [60]. not a common standpoint about the level of embedding to use
In general, a lot of effort has been devoted to compare feature or how a single model can encode a multilingual or language-
extractors based on CNN, LSTM, or BiLSTM and different levels agnostic feature space use-full for the downstream tasks. How-
of embeddings. We analyzed the co-occurrences of the different ever, this debate seems to be shifted to the transformer-based ar-
types of networks, attention mechanisms, etc. within the same chitectures where different tokenizers are being considered [85].
model, to visualize how authors have been using them. Edges Moreover, despite the success in training transformers in a mul-
in Fig. 3 mean that the two concepts at nodes have been used tilingual corpus, recent studies suggest that there is a lot of room
together in a model, the weight, how often this relationship has for improvement [86]. In this sense probably we will see an
increased number of works studying the impact of the differences
between languages and language families.
5 Under this category, we considered works such as [66] which used the
SemEval 2016 Task 5, and other corpora that aggregated text from different
sources. 6 https://github.com/google-research/bert/blob/master/multilingual.md

8
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

Fig. 2. Languages vs Social Media.

Paraguay (Jopara,7 Portuñol), to mention few cases. Nevertheless,


the scarce of available corpora is a challenge for tackle these code-
mixing tasks. For instance, in [71] we could observe that with an
English mixing MSA setup, BERT had been unable to outperform
the traditional DL models. Thus, substantial progress needs still
to be made.

6. Conclusions

In this work, we reviewed 24 that studied 23 and 11 dif-


ferent languages and sources. The observed trend evidences the
steady interest in this domain, so we expect to see this direction
continue.
As regards the different MSA setups, the multilingual approach
seems to be decreasing in interest. However, aspect-based sen-
timent analysis is still an understudied domain and an open
Fig. 3. Neural Network architectures and its relations, out of 24, across reviewed research field with a lot of scope for future works.
papers.
We highlighted the main ideas authors proposed to tackle the
challenge that represents the lack of annotated data or to achieve
language independent models. Despite state-of-the-art results
in some cases, the simpler backbone comprising embeddings, a
feature extractor, and a classifier seems to be unappropriated
for more complex scenarios. Also, there are unsolved questions
such as which type of embedding captures better the particulars
of MSA. We hint about future research directions, for example,
if ideas such as contextualized embeddings, which have proven
very useful in other tasks, can further improve MSA. Finally,
although studies have covered very different languages such as
Arabic, Chinese, or Hindi, the world is extraordinarily rich in
languages, cultures, and ways of expressing feelings. Thus, better
Fig. 4. MSA approaches, out of 24 papers, across fours years.
approaches need to be assessed or developed for new scenarios.

SA-specific representations. For MSA, the aforementioned archi- CRediT authorship contribution statement
tectures may handle the specifics of this domain such as the
code-switching or the aspect-based setup. It will be necessary to Marvin M. Agüero-Torales: Conceptualization, Formal anal-
study if it is worth to couple techniques such as attention be- ysis, Investigation, Data curation, Writing - original draft, Visu-
tween different levels of representation, sentiment embeddings, alization. José I. Abreu Salas: Validation, Investigation, Writing
or adversarial learning (e.g GAN-BERT [87]). - original draft, Writing - review & editing. Antonio G. López-
Herrera: Conceptualization, Methodology, Writing - original draft,
Low-resource languages and dialects. Despite languages from dif- Supervision.
ferent families has been studied (see Fig. 2) the coverage is
far from complete. Moreover, the steady interest in sentiment Declaration of competing interest
analysis, the lack so far of a universal approach, and the new
opportunities [88,89] would trigger the development of systems
The authors declare that they have no known competing finan-
and corpora for SA in other languages. In this sense, we would
cial interests or personal relationships that could have appeared
see tailored solutions dealing with dialects and mixing languages.
to influence the work reported in this paper.
Besides India (native languages mixing with English) there are
other large groups such as (a) Mexico and USA (Spanglish), (b)
Brazil and its border countries, Portugal and Spain (Portuñol) (c) 7 Mixing Guarani (an indigenous language) with Spanish.

9
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

Acknowledgments [26] M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettle-
moyer, Deep contextualized word representations, in: Proc. of NAACL-HLT,
2018, pp. 2227–2237.
This research work has been partially funded by the Gener-
[27] J. Yu, J. Jiang, Learning Sentence Embeddings with Auxiliary Tasks for
alitat Valenciana (Conselleria d’Educació, Investigació, Cultura i Cross-Domain Sentiment Classification, ACL, 2016.
Esport) and the Spanish Government through the projects SIIA [28] H. Xu, B. Liu, L. Shu, P.S. Yu, Double embeddings and CNN-based sequence
(PROMETEO/2018/089, PROMETEU/2018/089) and LIVING-LANG labeling for aspect extraction, in: Proc. of the 56th Annual Meeting of the
(RTI2018-094653-B-C22). We are also immensely grateful to ACL (Vol. 2: Short Papers), ACL, 2018, pp. 592–598.
[29] B. Huang, Y. Ou, K.M. Carley, Aspect level sentiment classification
David Vilares (Universidade da Coruña, Spain), for his recommen- with attention-over-attention neural networks, in: Int. Conf. on Social
dations provided a valuable orientation. Computing, Behavioral-Cultural Modeling and Prediction and Behavior
Representation in Modeling and Simulation, Springer, 2018, pp. 197–206.
References [30] Y. Ma, H. Peng, E. Cambria, Targeted aspect-based sentiment analysis
via embedding commonsense knowledge into an attentive LSTM, in:
Thirty-Second AAAI Conf. on Artificial Intelligence, 2018.
[1] S.L. Lo, E. Cambria, R. Chiong, D. Cornforth, Multilingual sentiment analysis:
[31] S. Ruder, P. Ghaffari, J.G. Breslin, A hierarchical model of reviews for aspect-
from formal to informal and scarce resource languages, Artif. Intell. Rev.
based sentiment analysis, in: Proc. of the 2016 Conf. on Empirical Methods
48 (4) (2017) 499–527.
in NLP, 2016, pp. 999–1005.
[2] B. Pang, L. Lee, et al., Opinion mining and sentiment analysis, Found.
[32] D. Tang, B. Qin, T. Liu, Document modeling with gated recurrent neural
Trends⃝ R
Inf. Retr. 2 (1–2) (2008) 1–135.
network for sentiment classification, in: Proc. of the 2015 Conf. on
[3] S. Wang, C.D. Manning, Baselines and bigrams: Simple, good sentiment
Empirical Methods in NLP, 2015, pp. 1422–1432.
and topic classification, in: Proc. of the 50th Annual Meeting of the ACL:
[33] G. Rao, W. Huang, Z. Feng, Q. Cong, LSTM with sentence representations
Short Papers, Vol. 2, ACL, 2012, pp. 90–94.
for document-level sentiment classification, Neurocomputing 308 (2018)
[4] B. Liu, Sentiment analysis and opinion mining, Synth. Lect. Human Lang.
49–57.
Technol. 5 (1) (2012) 1–167.
[34] Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by backpropaga-
[5] B. Liu, Sentiment Analysis: Mining Opinions, Sentiments, and Emotions,
tion, in: Proc. of the 32nd Int. Conf. on Int. Conf. on Machine Learning,
Cambridge University Press, 2015. Vol. 37, ICML’15, JMLR.org, 2015, pp. 1180–1189.
[6] R. Socher, A. Perelygin, J. Wu, J. Chuang, C.D. Manning, A. Ng, C. Potts, [35] Z. Li, Y. Zhang, Y. Wei, Y. Wu, Q. Yang, End-to-end adversarial memory
Recursive deep models for semantic compositionality over a sentiment network for cross-domain sentiment classification, in: IJCAI, 2017, pp.
treebank, in: Proc. of the 2013 Conf. on Empirical Methods in NLP, 2013, 2237–2243.
pp. 1631–1642. [36] B. Agarwal, R. Nayak, N. Mittal, S. Patnaik, Deep Learning-Based Approaches
[7] C. Dos Santos, M. Gatti, Deep convolutional neural networks for sentiment for Sentiment Analysis, Springer, 2020.
analysis of short texts, in: Proc. of COLING 2014, the 25th Int. Conf. on [37] J. Wehrmann, W. Becker, H.E.L. Cagnini, R.C. Barros, A character-based
Computational Linguistics: Technical Papers, 2014, pp. 69–78. convolutional neural network for language-agnostic Twitter sentiment
[8] Y. Kim, Convolutional neural networks for sentence classification, in: Proc. analysis, in: 2017 Int. Joint Conf. on Neural Networks, IJCNN, 2017, pp.
of the 2014 Conf. on Empirical Methods in NLP, EMNLP, ACL, Doha, Qatar, 2384–2391.
2014, pp. 1746–1751. [38] J.T. Zhou, S.J. Pan, I.W. Tsang, Y. Yan, Hybrid heterogeneous transfer
[9] D. Tang, B. Qin, T. Liu, Deep learning for sentiment analysis: successful learning through deep learning, in: Twenty-Eighth AAAI Conf. on Artificial
approaches and future challenges, Wiley Interdiscip. Rev.: Data Min. Intelligence, 2014.
Knowl. Discov. 5 (6) (2015) 292–303. [39] G. Zhou, Z. Zeng, J.X. Huang, T. He, Transfer learning for cross-lingual
[10] L.M. Rojas-Barahona, Deep learning for sentiment analysis, Lang. Linguist. sentiment classification with weakly shared deep neural networks, in:
Compass 10 (12) (2016) 701–719. Proc. of the 39th Int. ACM SIGIR Conf. on Research and Development in
[11] P. Singhal, P. Bhattacharyya, Sentiment Analysis and Deep Learning: A Sur- Information Retrieval, ACM, 2016, pp. 245–254.
vey, Center for Indian Language Technology, Indian Institute of Technology, [40] P. Singhal, P. Bhattacharyya, Borrow a little from your rich cousin:
Bombay, 2016. Using embeddings and polarities of English words for multilingual sen-
[12] Q.T. Ain, M. Ali, A. Riaz, A. Noureen, M. Kamran, B. Hayat, A. Rehman, timent classification, in: Proc. of COLING 2016, the 26th Int. Conf. on
Sentiment analysis using deep learning techniques: a review, Int. J. Adv. Computational Linguistics: Technical Papers, 2016, pp. 3053–3062.
Comput. Sci. Appl. 8 (6) (2017) 424. [41] Z. Wang, Y. Zhang, S. Lee, S. Li, G. Zhou, A bilingual attention network for
[13] D. Vilares, Compositional Language Processing for Multilingual Sentiment code-switched emotion prediction, in: Proc. of COLING 2016, The 26th Int.
Analysis (Ph.D. thesis), Universidade da Coruña, 2017. Conf. on Computational Linguistics: Technical Papers, 2016, pp. 1624–1634.
[14] L.J. Zhang, S. Wang, B. Liu, Deep learning for sentiment analysis: A survey, [42] S. Ghosh, S. Ghosh, D. Das, Sentiment identification in code-mixed social
Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8 (2018). media text, 2017, arXiv:abs/1707.01184.
[15] O. Habimana, Y. Li, R. Li, X. Gu, G. Yu, Sentiment analysis using deep [43] G.I. Ahmad, J. Singla, N. Nikita, Review on sentiment analysis of Indian
learning approaches: an overview, Sci. China Inf. Sci. 63 (1) (2019) 1–36. languages with a special focus on code mixed Indian languages, in: 2019
[16] R. Wadawadagi, V. Pagi, Sentiment analysis with deep neural networks: Int. Conf. on Automation, Computational and Technology Management,
comparative study and performance assessment, Artif. Intell. Rev. (2020). ICACTM, IEEE, 2019, pp. 352–356.
[17] H. Nankani, H. Dutta, H. Shrivastava, P.R. Krishna, D. Mahata, R.R. Shah, [44] E. Tromp, Multilingual Sentiment Analysis on Social Media, Lap Lambert
Multilingual sentiment analysis, in: Deep Learning-Based Approaches for Academic Publ, 2012.
Sentiment Analysis, Springer, 2020, pp. 193–236. [45] C. Banea, R. Mihalcea, J. Wiebe, Multilingual sentiment and subjectivity
[18] J.M. Wiebe, R.F. Bruce, T.P. O’Hara, Development and use of a gold-standard analysis, in: Multilingual NLP, Vol. 6, Prentice-Hall New York, New York,
data set for subjectivity classifications, in: Proc. of the 37th Annual Meeting 2011, pp. 1–19.
of the ACL, 1999, pp. 246–253. [46] I.S.V. Roncal, Multilingual Sentiment Analysis in Social Media (Ph.D. thesis),
[19] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: sentiment classification Universidad del País Vasco-Euskal Herriko Unibertsitatea, 2019.
using machine learning techniques, in: Proc. of the ACL-02 Conf. on [47] F. Steiner-Correa, M.I. Viedma-del Jesus, A. Lopez-Herrera, A survey of
Empirical Methods in NLP, Vol. 10, ACL, 2002, pp. 79–86. multilingual human-tagged short message datasets for sentiment analysis
[20] P.D. Turney, Thumbs up or thumbs down?: semantic orientation applied to tasks, Soft Comput. 22 (24) (2018) 8227–8242.
unsupervised classification of reviews, in: Proc. of the 40th Annual Meeting [48] M.A. Abdullah, Deep Learning for Sentiment and Emotion Detection in
on ACL, ACL, 2002, pp. 417–424. Multilingual Contexts (ProQuest dissertations and theses, Ph.D. thesis),
[21] R. Feldman, Techniques and applications for sentiment analysis, Commun. 2018, p. 103, Copyright - Database copyright ProQuest LLC; ProQuest does
ACM 56 (4) (2013) 82–89. not claim copyright in the individual underlying works; Last update -
[22] E. Cambria, B. Schuller, Y. Xia, C. Havasi, New avenues in opinion mining 2019-10-18.
and sentiment analysis, IEEE Intell. Syst. 28 (2) (2013) 15–21. [49] B. Ay Karakuş, M. Talo, İ.R. Hallaç, G. Aydin, Evaluating deep learning
[23] S. Rosenthal, N. Farra, P. Nakov, SemEval-2017 task 4: Sentiment analysis models for sentiment classification, Concurr. Comput.: Pract. Exper. 30 (21)
in Twitter, in: Proc. of the 11th Int. Workshop on Semantic Evaluation, (2018) e4783.
SemEval ’17, ACL, Vancouver, Canada, 2017. [50] J. Deriu, A. Lucchi, V. De Luca, A. Severyn, S. Müller, M. Cieliebak, T.
[24] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed represen- Hofmann, M. Jaggi, Leveraging large amounts of weakly supervised data
tations of words and phrases and their compositionality, in: Neural and for multi-language sentiment classification, in: Proc. of the 26th Int. Conf.
Information Processing System, NIPS, 2013. on World Wide Web, 2017, pp. 1045–1052.
[25] J. Pennington, R. Socher, C.D. Manning, GloVe: Global vectors for [51] W. Becker, J. Wehrmann, H.E. Cagnini, R.C. Barros, An efficient deep neural
word representation, in: Empirical Methods in NLP, EMNLP, 2014, pp. architecture for multilingual sentiment analysis in Twitter, in: The Thirtieth
1532–1543. Int. Flairs Conf., 2017.

10
M.M. Agüero-Torales, J.I. Abreu Salas and A.G. López-Herrera Applied Soft Computing 107 (2021) 107373

[52] I. Mozetič, M. Grčar, J. Smailović, Multilingual Twitter sentiment clas- [79] G. Lample, A. Conneau, M. Ranzato, L. Denoyer, H. Jégou, Word translation
sification: The role of human annotators, PLoS One 11 (5) (2016) without parallel data, in: Int. Conf. on Learning Representations, 2018.
e0155036. [80] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, Towards sub-word level
[53] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. compositions for sentiment analysis of Hindi-English code mixed text, in:
9 (8) (1997) 1735–1780. Proc. of COLING 2016, the 26th Int. Conf. on Computational Linguistics:
[54] S. Zhang, X. Zhang, J. Chan, Language-independent Twitter classification Technical Papers, 2016, pp. 2482–2491.
using character-based convolutional networks, in: Int. Conf. on Advanced [81] B.G. Patra, D. Das, A. Das, Sentiment analysis of code-mixed Indian
Data Mining and Applications, Springer, 2017, pp. 413–425. languages: an overview of SAIL_Code-Mixed Shared Task@ ICON-2017,
[55] L. Medrouk, A. Pappa, Deep learning model for sentiment analysis in multi- 2018, arXiv preprint arXiv:1803.06745.
lingual corpus, in: Int. Conf. on Neural Information Processing, Springer, [82] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep
2017, pp. 205–212. bidirectional transformers for language understanding, 2018, arXiv preprint
[56] S. Zhang, X. Zhang, J. Chan, A word-character convolutional neural network arXiv:1810.04805v2.
for language-agnostic Twitter sentiment analysis, in: Proc. of the 22nd [83] Y. Zhu, X. Gao, W. Zhang, S. Liu, Y. Zhang, A bi-directional LSTM-CNN
Australasian Document Computing Symposium, ACM, 2017, p. 12. model with attention for aspect-level text classification, Future Internet
[57] N. Choudhary, R. Singh, I. Bindlish, M. Shrivastava, Sentiment analysis 10 (12) (2018) 116.
of code-mixed languages leveraging resource rich languages, in: 19th [84] B. Heinzerling, M. Strube, BPEmb: Tokenization-free pre-trained subword
Int. Conf. on Computational Linguistics and Intelligent Text Processing, embeddings in 275 languages, in: N. Calzolari, K. Choukri, C. Cieri, T.
CICLing-2018, 2018. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H.
[58] X. Dong, G. De Melo, Cross-lingual propagation for deep sentiment analysis, Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Tokunaga (Eds.), Proc. of the
in: Thirty-Second AAAI Conf. on Artificial Intelligence, 2018. Eleventh Int. Conf. on Language Resources and Evaluation, LREC 2018,
[59] A. Konate, R. Du, Sentiment analysis of code-mixed Bambara-French social European Language Resources Association (ELRA), Miyazaki, Japan, 2018,
media text using deep learning techniques, Wuhan Univ. J. Nat. Sci. 23 (3) (Conf. chair).
(2018) 237–243. [85] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T.
[60] M. Gopal Jhanwar, A. Das, An ensemble model for sentiment analysis of Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y.
Hindi-English code-mixed data, 2018, arXiv preprint arXiv:1806.04450. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush,
[61] L. Medrouk, A. Pappa, Do deep networks really need complex modules for Transformers: State-of-the-art NLP, in: Proc. of the 2020 Conf. on Empirical
multilingual sentiment polarity detection and domain classification? in: Methods in NLP: System Demonstrations, ACL, 2020, pp. 38–45, Online.
2018 Int. Joint Conf. on Neural Networks, IJCNN, IEEE, 2018, pp. 1–6. [86] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual
[62] K. Shalini, H.B. Ganesh, M.A. Kumar, K. Soman, Sentiment analysis for code- BERT? in: Proc. of the 57th Annual Meeting of the ACL, ACL, Florence,
mixed Indian social media text with distributed representation, in: 2018 Italy, 2019, pp. 4996–5001.
Int. Conf. on Advances in Computing, Communications and Informatics, [87] D. Croce, G. Castellucci, R. Basili, GAN-BERT: Generative adversarial learn-
ICACCI, IEEE, 2018, pp. 1126–1131. ing for robust text classification with a bunch of labeled examples, in:
[63] K. Stavridis, G. Koloniari, E. Keramopoulos, Deriving word embeddings Proc. of the 58th Annual Meeting of the ACL, ACL, 2020, pp. 2114–2119,
using multilingual transfer learning for opinion mining, in: 2018 South- Online.
Eastern European Design Automation, Computer Engineering, Computer [88] L. Yue, W. Chen, X. Li, W. Zuo, M. Yin, A survey of sentiment analysis in
Networks and Society Media Conf., SEEDA_CECNSM, IEEE, 2018, pp. 1–6. social media, Knowl. Inf. Syst. 60 (2) (2019) 617–663.
[64] W. Wang, S. Feng, W. Gao, D. Wang, Y. Zhang, Personalized microblog [89] A. Zunic, P. Corcoran, I. Spasic, Sentiment analysis in health and well-being:
sentiment classification via adversarial cross-lingual multi-task learning, Systematic review, JMIR Med. Inform. 8 (1) (2020) e16023.
in: Proc. of the 2018 Conf. on Empirical Methods in NLP, 2018, pp.
338–348.
[65] X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, K. Weinberger, Adversarial deep
averaging networks for cross-lingual sentiment classification, in: Proc. of Marvin M. Agüero-Torales is a Ph.D. student of the
the 2018 Conf. on Empirical Methods in NLP, 2018, pp. 557–570. Doctoral Programme in Information and Communica-
[66] G. Liu, X. Huang, X. Liu, A. Yang, A novel aspect-based sentiment analysis tion Technologies at the University of Granada (UGR)
network model based on multilingual hierarchy in online social network, at the Department of Computer Science and Artifi-
Comput. J. (2019). cial Intelligence and, Research Engineer at the UOC
[67] Y.K. Lal, V. Kumar, M. Dhar, M. Shrivastava, P. Koehn, De-mixing sentiment (Universitat Oberta de Catalunya, IN3). Former mem-
from code-mixed text, in: Proc. of the 57th Annual Meeting of the ACL: ber of the Universidade da Coruña (UDC) and BSC-
Student Research Workshop, 2019, pp. 371–377. CNS. His research experience and interests include (i)
[68] M. Jabreel, N. Maaroof, A. Valls, A. Moreno, UniSent: Universal sentiment natural language processing; (ii) text mining; (iii) ma-
analysis system for low-resource languages, in: CCIA, 2019, pp. 387–396. chine learning; with a particular focus on low-resource
[69] S. Mukherjee, Deep learning technique for sentiment analysis of Hindi- languages.
English code-mixed text using late fusion of character and word features,
in: 2019 IEEE 16th India Council Int. Conf., INDICON, IEEE, 2019, pp. 1–4.
[70] J. Shen, X. Liao, S. Lei, Cross-lingual sentiment analysis via AAE and BiGRU, José I. Abreu Salas is a Researcher at the University
in: 2020 Asia-Pacific Conf. on Image Processing, Electronics and Computers, Institute for Computing Research (IUII), Alicante. He
IPEC, IEEE, 2020, pp. 237–241. has been a member of the Cuban chapter of the
[71] A. Jamatia, S. Swamy, B. Gambäck, A. Das, S. Debbarma, Deep learn- International Association of Pattern Recognition, and
ing based sentiment analysis in a code-mixed English-Hindi and a Full-Time Professor at the University of Matanzas
English-Bengali social media corpus, Int. J. Artif. Intell. Tools 29 (5) (2020). (UMCC), and Catholic University of the Most Holy
[72] K. Kanclerz, P. Miłkowski, J. Kocoń, Cross-lingual deep neural transfer Conception (UCSC). His research interest covers (i)
learning in sentiment analysis, Procedia Comput. Sci. 176 (2020) 128–137. data-driven solutions in natural language process-
[73] R. Chundi, V.R. Hulipalled, J. Simha, SAEKCS: Sentiment analysis for ing (ii) instance selection and prototype construction
English–Kannada code switchtext using deep learning techniques, in: 2020 algorithms.
Int. Conf. on Smart Technologies in Computing, Electrical and Electronics,
ICSTCEE, IEEE, 2020, pp. 327–331.
[74] R. Ghasemi, S.A. Ashrafi Asli, S. Momtazi, Deep Persian sentiment analysis:
Cross-lingual training for low-resource languages, J. Inf. Sci. (2020) 1–14. Antonio G. Lopez-Herrera is an Associate Professor
[75] M. Pontiki, D. Galanis, H. Papageorgiou, I. Androutsopoulos, S. Manandhar, in the Department of Computer Science and Artificial
A.-S. Mohammad, M. Al-Ayyoub, Y. Zhao, B. Qin, O. De Clercq, et al., Intelligence at the University of Granada (UGR). He
Semeval-2016 task 5: Aspect based sentiment analysis, in: Proc. of the 10th holds a degree in Computer Engineering (2003, UGR),
Int. Workshop on Semantic Evaluation, SemEval-2016, 2016, pp. 19–30. a Ph.D. in Computer Science (2006, UGR), a degree
[76] M. Iyyer, V. Manjunatha, J. Boyd-Graber, H. Daumé III, Deep unordered in Library and Information Science (2007, UGR) and
composition rivals syntactic methods for text classification, in: Proc. of a Master in Information and Scientific Communication
the 53rd Annual Meeting of the ACL and the 7th Int. Joint Conf. on NLP (2008, UGR). His lines of research include information
(Vol. 1: Long Papers), 2015, pp. 1681–1691. access, retrieval, filtering and evaluation, recommender
[77] W.Y. Zou, R. Socher, D. Cer, C.D. Manning, Bilingual word embeddings for systems, opinion mining, sentiment analysis, and bib-
phrase-based machine translation, in: Proc. of the 2013 Conf. on Empirical liometrics. He teaches at the School of Computer (UGR)
Methods in NLP, 2013, pp. 1393–1398. and at the Faculty of Communication and Documentation (UGR).
[78] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial
autoencoders, 2016, arXiv:1511.05644.

11

You might also like