1 Introduction

According to Statista, there are 5.03 billion active internet users worldwide, about 63% of the global population, as of October 2022. Of this total, 93% are social media users [1]. Social media, such as Twitter, Reddit, and Facebook, has become an integral part of our life. We share almost everything on social media, from social and business events to personal opinions and emotions. Besides, social media is also a popular and reliable platform for obtaining information in almost real-time. People have strong trust in the information received and shared by other users on social media. In other words, people can inform and influence one another via social media platforms. This has significant social, political, and economic impacts on society.

Nowadays, almost all businesses are using social media to engage directly with their consumers to understand their needs and advertise their products or services. Consumers have complete control of what they want to see and how they want to respond. A single product review may affect consumer behavior and decision-making. As a result, the company’s success and failure are made public and spread quickly and widely through the social media platforms. For example, a survey done by Podium claimed that 93% of internet users are influenced by customer reviews in their purchases and decisions [2]. So, if a company can keep up quicker with their customers’ thinking, it would be more advantageous to respond promptly and devise a successful strategy to compete with their rivals.

Another impact of social media was observed during the outbreak of the COVID-19 pandemic that emerged in December 2019 [3], which has infected more than 619 million people and caused more than 6.55 million deaths as of October 2022, and the number is still increasing. This has created great fear of getting infected and tremendous stress worrying their daily life. According to the American Psychological Association, US adults recorded the highest stress level since the early days of the COVID-19 pandemic, and 80% of it is due to the prolonged stress caused by COVID-19 [4]. Social media has become one of the quickest ways for individuals to express themselves, and this causes the newsfeed to flood with information representing their thoughts. Undoubtedly, analyzing these newsfeeds is a direct way to capture their emotions and sentiment [5].

Sentiment analysis (also known as opinion mining) is the process of identifying, extracting, and classifying subjective information from unstructured text using text analysis and computational linguistic techniques in Natural Language Processing (NLP) [6]. It aims to determine the polarity of sentences using word clues extracted from the sentence’s meaning [, 7, 8]. As a result, sentiment analysis is an essential technique for extracting valuable information from unstructured data sources, including tweets and reviews. It is widely used in extracting opinions from online sources, such as product reviews on the Web [9]. Since then, several other fields have been targeted using sentiment analysis, such as stock market forecasts [10] and responses to terrorist attacks [11]. In addition, research work that overlaps sentiment analysis and the production of natural language has discussed several concerns that relate to the applicability of sentiment analysis, such as multi-lingual support [12] and irony detection [13]. Recently, sentiment analysis shows its importance and plays a crucial role in understanding people’s feelings during the COVID-19 pandemic. It helps the government to understand people’s concerns about COVID-19 and take the appropriate measures accordingly [14].

Despite the potential benefits, automated analysis has limitations, including the complexity of its implementation due to the uncertainty of natural language and the characteristics of the posted content. The study of tweets is an example since hashtags and emoticons often accompany them and links, making identifying the expressed sentiment difficult. Furthermore, automated techniques need large datasets of annotated posts or lexical databases of emotional terms associated with sentiment values. Unlike humans, machines have difficulties relating to the subjectivity of a text, such as sarcastic context [15]. People frequently use encouraging words to express their negative feelings in sarcastic texts. This fact enables sarcasm to trick sentiment analysis models easily unless explicitly designed to take sarcasm into account. Ultimately, the variation in the terms used in sarcastic sentences makes it difficult to train a sentiment analysis model.

Owing to the misclassification of sarcastic texts that could change the polarity of a sentence, the primary aim of this paper is to study the effect of sarcasm detection on sentiment analysis to improve the existing sentiment analysis model for better accuracy and more intelligent information extraction. The contribution of this paper is twofold. First, we design a general framework that tackles sentiment analysis and sarcasm detection for more intelligent and accurate information extraction. Second, we propose deep multi-task learning to simultaneously train two models for sentiment analysis and sarcasm detection, respectively, to reduce model complexity and increase model efficiency.

This paper is divided into five sections. A comprehensive literature review of the related work is presented in Sect. 2. Section 3 provides a detailed explanation on the proposed multi-task learning framework for sentiment analysis and sarcasm detection. Performance evaluation is presented in Sect. 4. Section 5 gives the conclusions and possible future work.

2 Related Work

The roots of sentiment analysis on written documents can be observed back in World War II, where the focus was mainly on politics in nature. It became an active research focus since the mid of 2000s utilizing Natural Language Processing (NLP) to mine subjective information from various contents on the Internet. Different methods have been proposed to train sentiment analysis models ranging from traditional machine learning algorithms to deep learning algorithms.

2.1 Sentiment Analysis Using Machine Learning

Most natural language processing algorithms were based on complex sets of hand-written rules up to the 1980s. Since then, a revolution in natural language processing has been seen with the introduction of machine learning algorithms. Some early work involved the classification of sentiment based on the categorization method with positive and negative sentiments, such as in [7], in which three machine learning algorithms were used in their experiment for sentiment classification. These algorithms are Support Vector Machine (SVM), Naïve Bayes classifier, and Maximum Entropy. The classification process was carried out using the n-gram method; they are the unigram, bigram, and the combination of both. Besides, they also used the bag-of-word (BOW) paradigm to introduce algorithms for machine learning. Great potential has been observed based on the promising performance shown in their studies.

The syntactic relation between words has been used in [16] to analyze document-level sentiment. In this paper, the sub-sequences of frequent terms and the sub-trees of dependence were derived from sentences that serve as features for the SVM algorithm. Unigram, bigram, word subsequence, and dependence have also been extracted from each sentence in the dataset. Another similar work involves using a mix of unsupervised and supervised techniques to learn word vectors and subsequently capture the semantic term (document information) and rich sentiment contents [17].

In [18], a mechanism that embeds the higher-order n-gram phrases with the low-order dimensional semantic latent space was proposed to define a sentiment classification function. They also used a SVM to construct a discriminative system that estimates latent space parameters and with a bias towards the classification task. This method can perform both binary classifications and multi-score sentiment classifications, which involve prediction within a set of sentiment scores.

A sentiment classification method using entropy weighted genetic algorithm (EWGA), and SVM has been proposed in [19]. Different sets of features consisting of syntactic and stylistic characteristics have been evaluated. In terms of stylistic, it reflects the measure of word length distribution, vocabulary richness, and frequency of special characters. Weights are allocated for different sentiment attributes before the genetic algorithm is used to optimize the sentiment classification. SVM with a ten-fold cross-validation technique was used to validate the model, and promising results were obtained.

2.2 Sentiment Analysis using Deep Learning

In recent years, deep learning has received an overwhelming reception as deep learning does not require traditional, task-specific feature engineering, which makes it a more powerful alternative for sentiment analysis. In [20], an architecture using a deep neural network to determine the similarity of documents was proposed. This architecture was trained to generate vector foes articles by using multiple market news obtained from T&C. The cosine similarity was then calculated among the labeled papers considering the polarity of the documents, while neglecting the contents of the documents. The proposed method achieved an outstanding performance in terms of similarity estimates of the articles.

The authors in [21] suggested a sequence modeling based neural network for sentiment analysis at the document level, focusing on customers’ reviews having temporal nature. Their approach trained the recurrent neural network with the gated recurrent unit (RNN-GRU) to learn the distributed product and user representations. These representations were then fed into a machine learning classifier for sentiment classification. The method was evaluated on three datasets obtained from Yelp and IMDb. Also, each assessment was tagged according to the rating score, and the back-propagation algorithm was used to train the network with Adam’s stochastic optimization to calculate the loss function. Simulation results showed that the distributed product and user representation learning sequence modeling enhances the document-level performance of sentiment classification.

In [22] sentiment analysis at the sentence level by using a Deep Recurrent Neural Network (RNN) composed of Long Short-Term Memory (RNN-LSTM) was proposed because it was found that sentiment analysis that considers a unified feature set including the representative of word embedding, sentiment knowledge, sentiment shifter rules, statistical and linguistic knowledge had not been previously studied. This combination provided sequential processing and overcomes some of the flaws in the conventional methods. A hybrid deep learning method called ConvNet-SVMBoVW was proposed in [23] to deal with real-time data for fine-grained sentiment analysis. An aggregation model was created to calculate the hybrid polarity, and SVM was used to train the beg-of-visual-word (BoVW) to predict the sentiment of visual content. The proposed methods not only provided five levels of fine-grain sentiment analysis (highly positive, positive, neutral, negative, and highly negative) but also outperformed the existing methods.

In a study by [24], examined the sentiments in datasets containing reviews of cars and real estate in Arabic online. They used the Bi-LSTM (Bidirectional Long Short-Term Memory), LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit), CNN (Convolutional Neural Networks), and CNN-GRU deep learning algorithms in combination with the BERT word embedding model. The real estate dataset had about 6,434 opinions, whereas the automotive dataset included nearly 6,585 opinions. Three different sentiment kinds were assigned to the records in both datasets (negative, positive, and mixed). The BERT model with the LSTM produced the greatest F1 score of 98.71 percent for the vehicle dataset. On the other hand, utilizing the BERT model with the CNN, the maximum F1 score for the real estate dataset was 98.67%.

Recently, multi-task learning has gained substantial attention in deep learning research. Multi-task learning allows multiple tasks to be performed simultaneously by a shared model. [25] proposed a multi-task learning approach based on CNN and RNN. This model jointly learns the citation sentiment classification (CSC) and citation purpose classification (CPC) to boost the overall performance of automated citation analysis and simultaneously ease the problem of inadequate training data and time-consuming for feature engineering.

[26] proposed a method utilizing sarcasm detection to improve standalone sentiment classifiers, similar to our proposed method. This method required two explicitly trained models: the sentiment model and the sarcasm model. The feature extraction of sarcasm detection used top word features, and Unigram and 4 Boaziz features consist of punctuation-relate features, sentiment-relate features, and lexical and syntactic features. The sarcasm detection used the Random Forest algorithm, while the sentiment classification used the Naïve Bayes algorithm. The model achieved 80.4% accuracy, 91.3% recall, and 83.2% precision. The evaluation also showed that sarcasm detection improves the performance of sentiment analysis by about 5.49% improvement.

This section introduces different techniques used by researchers on Sentiment Analysis, such as N-gram, Hybrid MLTs, and Deep learning methods. Also, finding a suitable dataset and converting data to numerical vector form is a crucial step all researchers take to obtain better results. The accuracy obtained from various methods is high, such as the N-gram method achieved an accuracy of 94.6% using SVM [18], and the hybrid MLTs method achieved 91.7% using a Hybrid of EWGA and SVM [19]. Most of these deep learning methods outperformed the conventional ones. However, some shortcomings can be observed in the method. As discussed in the previous section, sarcasm context plays a vital role in sentiment classification. If sarcasm is not being considered in the system, the sarcastic text is classified as a positive tweet, which leads to misclassification. An additional step is required to solve this misclassification to obtain more accurate outcome.

A significant milestone has been achieved in the field of sentiment analysis in the last decades in mining and understanding text-based documents for various purposes using machine learning and deep learning methods. However, there are limitations in the existing techniques due to the inherent nature of languages, such as using sarcasm in expressing feelings. This issue was tackled in [26] using two explicitly trained models for each task. This method provides more accurate sentiment analysis but comes with a price of higher complexity, longer processing time, and possible overfitting. In this paper we propose a complete framework that enhances sentiment classification by detecting the presence of sarcasm in the sentence. In order to reduce the complexity and processing time, multi-task learning is deployed in our framework.

3 Deep Multi-Task Learning Framework for Sentiment Analysis with Sarcasm Detection

Sarcasm is defined as the use of remarks that clearly carry the opposite meaning or sentiment. It is made in order to mock or to annoy someone, or for humorous purposes. The presence of sarcasm in a document makes sentiment analysis less accurate as the conventional methods are not capable of detecting sarcasm. In this project we propose a framework for sentiment analysis that considers the possible presence of sarcasm. Specifically, the framework involves using pre-processed data to train a Bidirectional LSTM (Bi-LSTM) network to simultaneously perform two tasks: sentiment classification and sarcasm classification. This method is called multi-task learning and has been proven to improve learning efficiency and prediction accuracy for task-specific models. Besides, each classifier in the proposed framework has a distinct perceptron layer, and different activation functions are used in the perceptron layer to classify various classification tasks, either binary or multi-class. The architecture of the proposed framework is given in Fig. 1, and the details of each component in the framework are explained in detail in the remaining of this section.

Fig. 1
figure 1

Sentiment analysis and sarcasm detection using deep multi-task learning: overall framework

3.1 Data Acquisition

Data acquisition is the first step in machine and deep learning and usually involves spending lots of time and resources on gathering data that may or may not be relevant. Two datasets are needed in the proposed framework: a sentiment dataset and a sarcasm dataset. In this paper, the dataset used is obtained from Kaggle. The sentiment analysis dataset [27], which is extracted from Twitter, contains 227,599 tweets labeled as 0 for neutral, 1 for negative sentiment, and 2 for positive sentiment. In order to have a better understanding of the dataset, some sample headers are shown Fig. 2a, and the distribution is given in Fig. 2b. It is important to note that the dataset is slightly unbalanced, with the neutral label being the majority and the negative one being the minority. As for the sarcasm dataset [28], there are 28,619 tweets labeled with 0 for not sarcastic and 1 for sarcastic. Figure 3a shows some samples of the sarcastic and not sarcastic headers of tweets. Similarly, the dataset is slightly unbalanced, with more tweets that are not sarcastic (type 0) in Fig. 3b.

Fig. 2
figure 2

a Sentiment header data; b Amount of data for different label plotted in a graph

Fig. 3
figure 3

a Sarcasm header data; b Amount of data for different label plotted in a graph

3.2 Data Pre-Processing

Most of the datasets available online or collected manually are noisy and unstructured in nature. So pre-processing is needed to transform the noisy datasets into an understandable format for the training to ensure high accuracy. In this paper, the same pre-processing method is used for both datasets since both sentiment and sarcasm classifications use the same natural language processing (NLP) techniques. The pre-processing steps used in our work are summarized as follows:

  1. a.

    Removal of irrelevant words such as hyperlinks and noisy words (such as retweet and stock markets tickers and stop words).

  2. b.

    Removal of punctuations.

  3. c.

    Word stemming to reduce inflected to tier word stem or base.

  4. d.

    Tokenization to covert text to vector representation as input to deep learning model.

3.3 Deep Multi-Task Learning Neural Network

Figure 4 shows the overall architecture of the multi-task learning deep neural network. In this stage, the pre-processed data are converted into a numerical vector before they are used to train the Bi-LSTM network for both sentiment classification and sarcasm classification.

Fig. 4
figure 4

Architecture of Multi-Task Learning

3.3.1 Embedding Layer

Word embedding takes in the pre-processed input and converts each of them into a vector of numeric values. It is a typical method for representing words in text analysis, in which words with similar meanings are expected to be closer in the vector space. In this paper, the Word2Vec model is used to generate a word vector matrix to convert the inputs [29]. Word2Vec calculates the vectors to represent the degree of semantic similarity between words using the Cosine Similarity method. Each text of n words represents as \(T=\left[{w}_{1}, {w}_{2},\dots , {w}_{n}\right]\) is converted into an n-dimensional dense vector as given in Eq. 1.

$$T=\left[{w}_{1}, {w}_{2},\dots , {w}_{n}\right]\in {R}^{n*d}$$
(1)

Since each input text may have different lengths, but the length of the produced vectors is fixed, denoted as \(l\). With thin in mind, zero padding is used if the text has a length shorter than \(l\). Thus, all texts have the same matrix dimension and result in the final form of text input with \((l)\) dimension as given in Eq. 2.

$$T=\left[{w}_{1}, {w}_{2},\dots , {w}_{n}\right]\in {R}^{l*d}$$
(2)

3.3.2 Multi-Task Learning

The purpose of the proposed framework is to enhance the performance of standalone sentiment classification by adding an auxiliary task which is sarcasm detection. This can be achieved by training two explicit models sequentially: the sentiment model as the primary task and the sarcasm model as the secondary task. However, this method is inefficient, redundant, and costly in terms of processing time and computation power. With this in mind, we propose the use of multi-task learning in our framework to train these two tasks simultaneously with a shared Bi-LSTM layer. This method offers many advantages, including reduced overfitting through shared representations and reduced model complexity, improved data efficiency, and fast learning by leveraging auxiliary information. Besides, multi-task learning also helps in alleviating known weaknesses of deep learning methods, such as computational demand and large-scale data requirements [30].

As shown in Fig. 4, the two classification tasks share a single-layer Bi-LSTM recurrent neural network but separate multilayer perceptron (MLP) layer. The sentences obtained from word embedding are fed into the Bi-LSTM network, and the output is then passed through a fully connected layer (FCL) to get the sentence representation, \({Z}_{*}\). This operation can be represented by Eq. 3,

$${Z}_{*}=FCLaye{r}_{*}(bidirection\left(LSTM\left(X\right)\right))$$
(3)

where * represent the shared layer between the two tasks, and X is the list of input sentence obtained from the word embedding.

The respective classification is then obtained by applying the activation function to the sentence representation \({Z}_{*}\). In this framework, the activation functions used are softmax and sigmoid for sentiment classification and sarcasm classification, respectively, as given by Eqs. 4(a) and (b).

$${S}_{sentiment}=Softmax({Z}_{*})$$
(4a)
$${S}_{sarcasm}=Sigmoid({Z}_{*})$$
(4b)

3.3.3 Loss Function

The configuration of the frameworks allows the secondary task to inform the training on the primary task by computing the loss of the model using Eq. 5:

$${L}_{i}= \sum_{\left(x,y\right)\in {\Omega }_{1}}{L}_{1}\left(x,y\right)+ \sum_{\left(x,y\right) \in {\Omega }_{2}}{L}_{2}\left(x,y\right)$$
(5)

where \({L}_{1}\) denoted as the loss for the primary task, \({L}_{2}\) denoted as the loss for the secondary task, and \({L}_{i}\) is the total loss of the proposed model. The total loss \({L}_{i}\) is used to compute for each sentence present in the dataset \({\Omega }_{i}\).

The cross-entropy losses are given in Eqs. 6 and 7, respectively, for sentiment classification and sarcasm classification.

$$Categorical\, Cross \,Entropy=-{\sum }_{j}^{C}{t}_{i}\mathrm{log}(f(Softmax{)}_{i})$$
(6)
$$Binary \,Cross\, Entropy=-{t}_{i}\mathrm{log}\left({s}_{1}\right)-(1-{t}_{1})\mathrm{log}(1-{s}_{1})$$
(7)

In order to optimize the performance of the model, the RMSprop optimizer is used in our framework. At every epoch, the proposed algorithm computes the gradient of \({L}_{i}\) for each batch to fine-tune our parameters.

3.3.4 Bi-Directional Long Short-Term Memory (Bi-LSTM)

The use of deep recurrent neural networks (RNNs) has been proven to be a highly effective way in sentiment analysis, as RNNs have feedback loops in the recurrent layer that allows information to be maintained as ‘memory’ over time. Training conventional RNNs to solve problems that require learning long-term temporal dependencies, such as sentences or tweets, can be difficult because the gradient of the loss function decays exponentially with time, known as the vanishing gradient [31]. Hence, we use Bi-LSTM to tackle this problem. Bi-LSTM is a sequence processing model consisting of two LSTMs: one taking the input in a forward direction and the other in a backward direction. It provides more context to the algorithm and results in faster and more complete learning of the problem. In other words, it allows the network to preserve information from both the future and the past.

The basic architecture of the LSTM network is illustrated in Fig. 5. The LSTM network is a type of RNN that uses special units called memory cells and other standard units to solve the vanishing gradient problem. An LSTM network can accomplish this by incorporating a cell state and three different gates: the forget gate, the input gate, and the output gate. Using this explicit gating mechanism, the cell can determine what to do with the state vector at each step: read from it, write to it, or delete it. The input gate allows the cell to decide whether to update the cell state or not. The cell can also erase its memory with the forget gate to determine whether to make the output information available at the output gate. Ultimately, the ability to ‘memorize’ and forget makes LSTM a suitable method for the sentiment and sarcasm model because every word present in a sentence is crucial. Furthermore, it is vital to conserve information bi-directionally from the past to the future and from the future to the past when evaluating the nature of a sentence.

Fig. 5
figure 5

Architecture of LSTM

3.4 Multi-Layer Perceptron (MLP)

Figure 6 gives a deeper insight of the multilayer perceptron used in the proposed framework. It can be observed that the primary task and the secondary task have similar architecture except for the activation functions of the dense layer. For both tasks, we fed the ReLU activation function to ensure all negative elements present in \({Z}_{*}\) (outputs of the Bidirectional neural network) to be 0, as shown in Eq. 8. This implementation simplifies the backward propagation computation, giving benefits such as fast training and preventing vanishing gradients.

Fig. 6
figure 6

Architecture of Multilayer Perceptron (MLP)

$$ReLU(x)=\mathrm{max}(0,x)$$
(8)

Then, a dropout is added to the model as a regularizer where the algorithm randomly sets half of the activations on the fully connected layers to zero during training. This technique improves the generalization ability and reduces overfitting [32]. Finally, training an LSTM-based multi-task model involves remembering broadly diverse sequential data. In this case, RMSprop is a perfect optimizer as it uses a moving average of squared gradients to normalize the gradient itself [33]. This provides an effect of balancing the step size: increasing the step size for a small gradient to avoid vanishing gradient and decreasing the step size for a large gradient to avoid exploding gradient. Hence ensuring efficient learning by the model [34].

3.5 Model Analysis

In this section, we provide a preliminary analysis of the performance of the proposed framework. The proposed model is trained using the hyperparameters defined in Table 1, and Fig. 7 gives the training and validation losses. The validation loss reduces significantly from 1.4 to 0.5 from epoch 0 to 15, which shows that the model is learning effectively. From epoch 15 onwards, the model has stopped learning as the validation loss has reached a steady state, although the training losses are still decreasing. For this reason, the epoch is set to 10 for training since the validation loss equals the training loss with this number of epochs.

Table 1 Hyperparameters Settings
Fig. 7
figure 7

Validation and training losses

In terms of complexity, the proposed model is slightly more complex than the conventional deep neural networks due to bi-directional data propagation. Compared to similar classification work that involves two tasks [26], the proposed model is much simpler due to the multi-task learning. Figure 7 shows that the validation loss curve remains constant at around 0.5 instead of going upwards. This indicates that the proposed method successfully reduces overfitting.

In terms of training time, we compare the performance with the model presented in [26], as given in Fig. 8 by using the same hypermeters used in our proposed method. At a glance, [26] repeatedly uses the same input and hidden layer settings to train the sentiment and sarcasm model, which is redundant, and the model training has to be done twice. Without any doubt, the multi-task learning method allows us to train two tasks at each epoch and thus significantly lower the training time.

Fig. 8
figure 8

Architecture of two explicitly sentiment and sarcasm model

In terms of computation power, the proposed method requires less computation power because the neural network used is smaller than [26] work. The goal is to shrink the neural network size so that we can remove unnecessary processes and mathematical operations. Therefore, data passed through the hidden layer once instead of twice, significantly reducing unwanted processes and computation power.

4 Performance Evaluation

4.1 Experiment Settings

Figure 9 shows the overview of the experiments conducted to evaluate the performance of the proposed framework. There are two categories of datasets used. The first experiment was performed based on various baselines and variant models used in other similar works for a fair comparison. For the second experiment, we extracted data from multiple social media platforms, including Twitter and Reddit, to eliminate bias by feeding unseen data to the model.

Fig. 9
figure 9

Overview of performance evaluation

Table 2 shows the relationship between sentiment and sarcasm classifications. If the sentiment is positive and sarcastic, the tweet is then classified as negative. For example, a tweet, “So many assignments, I love school life so much.” The sentiment classifier will detect this tweet as positive because “Love school life so much” is highlighted by the classifier as a positive remark. On the other hand, the sarcasm classifier will identify it as sarcastic due to the phrase “So many assignments.” In this case, this tweet is negative.

Table 2 Sentiment and Sarcasm Relationship

4.2 Evaluation Metrics

The following standard metrics are used in this paper to benchmark the performance of the proposed framework.

  1. a.

    Recall is a measure of the ability of a model to detect positive in each sentence (as known as the sensitivity). It is the ratio of true positive (TP) to the total of True Positive and false negative (FN) as given in Eq. 9.

$$Recall= \frac{TP}{TP+FN}$$
(9)
  1. b.

    Precision is the accuracy of positive predictions, and it is defined as the ratio of true positives to the total predicted positives, including both true positive (TP) and false positive (FP) as given in Eq. 10.

$$Precision= \frac{TP}{TP+FP}$$
(10)
  1. iii.

    F1-score is a helpful metric to compare two classifiers. F1 score considers both recall and precision, which is defined as Eq. 11.

$$F1 score=2\times \frac{Precision\times Recall}{Precision+Recall}$$
(11)

Another commonly used parameter named accuracy is the ratio of total correct predictions to total predictions. The accuracy metric did not consider false negative/positive in the calculation, making this metric highly unreliable. Imagine having data of 1000 sarcastic tweets with 900 sarcastic and 100 not sarcastic. Suppose our classifier predicts all sarcastic, we will have a high accuracy of 90%, but in fact, we will get 0 using the F1-score metric due to the recall. Hence, our experiment will utilize the F1 score as the primary metric to evaluate our proposed model performance.

4.3 Experiment 1: Model Baselines and Variants

The dataset used is split into 80% for training and 20% for testing. The following baselines and variations of the models are compared.

  1. a.

    Standalone classifier with RNN with the following characteristics.

$${Z}_{sentiment}=FCLaye{r}_{sentiment}(RNN({X}_{sentiment}))$$
$${Z}_{sarcasm}=FCLaye{r}_{sarcasm}(RNN({X}_{sarcasm}))$$
$${S}_{sentiment}=Softmax({Z}_{sentiment})$$
$${S}_{sarcasm}=Sigmoid({Z}_{sarcasm})$$

In this case, there are two standalone models respectively, for sentiment and sarcasm models. X is the list of input sentences obtained from the word embeddings. We feed X to RNN and then pass the output through a fully connected layer (FCLayer) to get the sentence representation Z. Finally, Z is passed through the dense layer, which is softmax and sigmoid for sentiment and sarcasm classification, respectively (\({S}_{sentiment} \& {S}_{sarcasm}\)).

  1. b.

    Standalone classifier using CNN with the following characteristics:

$${Z}_{sentiment}=FCLaye{r}_{sentiment}(CNN({X}_{sentiment}))$$
$${Z}_{sarcasm}=FCLaye{r}_{sarcasm}(CNN({X}_{sarcasm}))$$
$${S}_{sentiment}=Softmax({Z}_{sentiment})$$
$${S}_{sarcasm}=Sigmoid({Z}_{sarcasm})$$

This case is the same as case (i) except CNN is used instead of RNN.

  1. iii.

    Standalone classifier using Bi-GRU with the following characteristics:

$${Z}_{sentiment}=FCLaye{r}_{sentiment}(Bidirection\left(GRU\left({X}_{sentiment}\right)\right))$$
$${Z}_{sarcasm}=FCLaye{r}_{sarcasm}(Bidirection\left(GRU\left({X}_{sarcasm}\right)\right))$$
$${S}_{sentiment}=Softmax({Z}_{sentiment})$$
$${S}_{sarcasm}=Sigmoid({Z}_{sarcasm})$$

This case is the same as the two previous cases except Bi-GRU is used.

  1. d.

    Standalone classifier using Bi-LSTM with the following characteristics:

$${Z}_{sentiment}=FCLaye{r}_{sentiment}(Bidirection\left(LSTM\left({X}_{sentiment}\right)\right))$$
$${Z}_{sarcasm}=FCLaye{r}_{sarcasm}(Bidirection\left(LSTM\left({X}_{sarcasm}\right)\right))$$
$${S}_{sentiment}=Softmax({Z}_{sentiment})$$
$${S}_{sarcasm}=Sigmoid({Z}_{sarcasm})$$

This case also involves two standalone models same as the previous cases except Bi-LSTM is used.

  1. v.

    Multi-task learning using LSTM with the following characteristics:

$${Z}_{*}=FCLaye{r}_{*}(bidirection\left(LSTM\left(X\right)\right))$$
$${S}_{sentiment}=Softmax({Z}_{*})$$
$${S}_{sarcasm}=Sigmoid({Z}_{*})$$

where * represent the shared layer between two tasks sentiment.

The experiment results are given in Fig. 10. First, when the four standalone models are compared, it is evident that the conventional neural networks offer poorer outcomes as compared to the more recent Bi-LSTM network. The Bi-LSTM outperforms other methods by a margin of 0.4%, 0.1%, and 0.1%, respectively, for RNN, CNN, and Bi-GRU. It is essential to highlight that this model achieves an F1-score of 91% for sentiment classification and 92% for sarcasm classification. This improvement is achieved using memory gates that preserve important data while erasing useless data in the Bi-LSTM network. Besides, this network also allows the machines to learn bi-directionally, so information from the past and the future are preserved for better classification.

Fig. 10
figure 10

Experimental results using different variety of models: a sentiment classification; b sarcasm classification

Next, we compare the standalone model with the proposed framework, which uses multi-task learning with a Bi-LSTM network. The proposed method outperforms all the other models in sentiment and sarcasm classification, with an F1-score of 94% and 93%, respectively. Compared to the standalone Bi-LSTM model, the improvement is 3% and 1%, respectively, for sentiment and sarcasm classification. This improvement is because multi-task learning utilizes a shared layer, reduces the risk of overfitting when computing gradient descent, and ensures efficient learning. This also means that the sarcasm classifier successfully boosts the sentiment classifier’s performance. It can also be observed that the margin of improvement for the sentiment classifier is more significant than the sarcasm classifier because sarcasm detection is a subtask of sentiment analysis.

It is also important to note that the performance of the proposed model is consistent when we analyze the precision and the recall values, in which the pattern is the same as the F1-Score. The margin of improvement is also about 3% and 1%, respectively, for sentiment and sarcasm classifications.

4.4 Experiment 2: Testing the Proposed Method Using Unbiased Datasets

In this experiment, we evaluate the performance of the proposed method using different datasets and compare it with the Bi-LSTM standalone model, which is the best standalone model from the previous experiment. This experiment aims to analyze the performance of our model in analyzing unseen data and gauge how well the model does in a real-life scenario. Three datasets obtained from Kaggle are used in this experiment, as summarized in Table 3, and explained as follows.

  1. a.

    Reddit dataset

Table 3 Distribution of Sentiment Sentences

The Reddit dataset was collected from the comment section of Reddit posts. It is an unbalanced dataset with major being positive sentiment. This dataset fits the purpose of this experiment well because it is from a different social media platform, which may have a different structure in writing

  1. b.

    Twitter US Airline dataset [35]

This dataset was collected from Twitter US Airline, consisting of reviews written by their customers. The contributors were asked to classify the nature of its comment as positive, negative, or neutral tweets, followed by their reasons for giving such comments. It is also an unbalanced dataset, with negative sentiment being the majority of the data.

  1. iii.

    Twitter dataset

This dataset was scraped from Twitter using the Tweepy module. We can see that it is an unbalanced dataset, with negative sentiment being the majority of data. This dataset is the closest in nature to the dataset used to train models.

Figure 11 shows the experimental results obtained through testing using different datasets. The F1-score is given in Fig. 11a. We can see that the Twitter dataset has an F1-score of 58% for the standalone sentiment classifier and 63% for the proposed method, which is the highest among the three datasets. We believe the reason for this better performance is the nature of the dataset, which is the same as the data used for training but of broader scope. The Reddit dataset obtained an F1-score of 55% for the standalone sentiment classifier and 61% for the proposed method, while the Twitter US airline dataset obtained an F1-score of 52% for the standalone sentiment classifier and 54% for the proposed method.

Fig. 11
figure 11

Experimental results using unbiased datasets: a F1-Score; b neutral score

Generally, the performance of the proposed model on unbiased data is relatively low, with an average performance of about 59%. To better understand this poor performance, we observe the neutral score of the proposed method on these datasets, as shown in 10(b). It is essential to highlight that the poor performance on one of the labels will significantly affect the overall F1-score of the model. Thus, it can be observed that a deficient neutral label F1-score with 48%, 51%, and 44% for Reddit, Twitter, and Twitter US airline datasets, respectively. The low F1-score is mainly due to the low recall scores on classifying neutral sentiment, which are 33%, 36%, and 29% for Reddit, Twitter, and Twitter US airline datasets, respectively. This suggests that many neutral sentiment sentences are classified as false negatives. Figure 12 shows the word cloud for the neutral label in these datasets. We can observe that these neutral words, mostly stop words, are removed in the pre-processing stage, and consequently, the model has difficulty classifying neutral tweets. Nevertheless, the proposed method still outperforms the existing methods in analyzing unbiased datasets.

Fig. 12
figure 12

Word cloud for neutral label in datasets

5 Conclusions and Future Work

Sentiment analysis plays a vital role in the current digital world. It can provide useful information obtained from the Internet, especially social media, for various purposes. In this paper, we tackle the problem of inaccurate sentiment analysis caused by the use of sarcasm in some sentences. More specifically, a multi-task learning framework for simultaneous sentiment analysis and sarcasm detection has been proposed. In this framework, sarcasm detection aims to detect the sarcastic context in a sentence and improve the accuracy of sentiment analysis. Two experiments using different databases were conducted to evaluate the performance of the proposed framework, and the results show promising improvement. It can be concluded that sarcasm tweets do affect the performance of the sentiment analysis model, and adding an efficient sarcasm detection model can significantly improve the overall performance of the sentiment analysis model.

The performance of the proposed method relies on the dataset used for training. As demonstrated in the second experiment presented in the previous section, the performance of the proposed framework is poorer on unseen datasets because neutral sentiments return poor recall scores, which also leads to lower F1 scores. This happens because the stop words, which are neutral in nature, are removed during the pre-processing stage, making the neutral class less significant. To overcome this problem, a more appropriate pre-processing technique is needed to effectively remove stop words without reducing the number of neutral statements in the datasets.