Keywords

1 Introduction

Social media has become one of the most commonplace sources of information for humans. Social media is used increasingly as a prime source of news and current affairs. However, social media is increasingly considered as a double-edged sword with regards to the way it can be used for the near real-time, global spreading of information. Whilst it allows users to share and consume information in an easy and user-tailored manner, there are typically no checks on what information is sent. Indeed, it is possible to produce fake and harmful news resulting in spreading of low quality or incorrect information [1]. This rapid spreading of fake news can cause enormous harm to individuals, society and even to countries. In this context, public sentiment has an increasing influence on society. Understanding public opinion is commonly demanded for many areas: branding of products, through to predicting events such as voting outcomes. The ability to accurately gauge public sentiment in near real-time is thus highly desirable. Sentiment analysis is one approach for establishing public sentiment. In this work, we consider the role of sentiment analysis and the way it may be used to distinguish real and fake news to help unlock some of the challenges related to unchecked fake news distribution.

As one of the most popular mainstream social networks that has attracted a huge number of users, Twitter has established itself as an important platform for the distribution of news: be the content fake or real. In this project, we utilise two Twitter news data sets: FakeNewsNet [5] and CredBank [18] reflecting real and fake news, and official (true) resources respectively. FakeNewsNet includes news categories related to politics and gossip, each of which provides Tweet IDs for real and fake news. CredBank comprises many topics and tweets that have been (manually) classified as real news. The total amount of data used was over 2.5 million tweets.

To tackle the classification of sentiment, this paper uses traditional Natural Language Processing (NLP) technology to process these corpora and applies two different machine learning methods to perform sentiment analysis and automatic discrimination/prediction of real or fake news as reported in social media networks, namely: Naive Bayes and Decision Trees [19]. The sentiment analysis undertaken focuses on the sentiment tendency of Tweet texts, the word polarity in texts, and the sentiment characteristics of language features. To tackle big data problems, we leverage machine learning approaches including Bi-Directional Long-Short Term Memory (Bi-LSTM) [13] to train real and fake news classifiers to test whether real and fake news could be distinguished based on their sentiment and importantly on their associated content.

2 Related Work

As an important medium for information dissemination, the information quality in social media has been explored by many researchers with extensive studies based on social network data and data analysis. Capturing the sentiment of posts has also been extensively explored [20]. This has included non-English sentiment analysis [2] as well as combinations of geographical location, time factor and sentiment analysis, e.g. changing sentiment in different locations at different times in New York City [3].

In this work we focus on real and fake news of short messages (tweets). Buntain and Golbeck (2017) [4] developed a Twitter data set (CredBank) targeted on the accurate assessment of news as shown on Twitter. Shu et al. (2019) provided a fake and real news data set (FakeNewsNet) collected from Twitter including two comprehensive data sets related to politics and gossip. In this work, each post contained news content, data sources and other information such as images and videos.

For sentiment analysis, many works focus on text pre-processing, feature extraction, and model training. For example, [6] use traditional natural language processing methods to establish a framework for sentiment classification with two key elements: a classifier and a feature extractor. In their work, classifiers were trained using machine learning models including Naive Bayes, Maximum Entropy, and Support Vector Machines (SVM). The feature extractor utilized N-Gram methods including unigram and bigrams. Naradhipa et al. [2] used the Levensthein distance to normalize the word format and divide the sentiment analysis into two types: sentence-level and word-level. As with many approaches, the sentiments were classified into positive, negative, and neutral. They used SVM with Maximum Entropy as their classification algorithm.

Buntain and Golbeck extracted 45 features from social media and divided these features into four categories: structural features, user features, content features and temporal features [4]. Shu et al. (2017) also extracted features related to different aspects of tweets, including news content-based features, language vocabulary-based features, visual-based features, social context-based features, and time sequence-based features. Tang et al. (2014) [7] proposed a new method for learning word-embeddings suited for sentiment analysis termed sentiment-specific word embedding. By consideration of the position of the content, this method addressed many of the shortcomings of traditional continuous word vector establishment methods, where only the establishment of vectors with similar sentiment word polarity was considered. They implemented this method and used it for training different neural networks.

Whilst various researchers have explored fake news and sentiment analysis, thus far no work has attempted to combine these two approaches. This is the contribution of this paper. Importantly, the work requires the development of a large-scale data processing platform capable of dealing with classification, feature extraction and sentiment analysis at scale.

3 System Architecture and Hypotheses

Social media data analytics, and especially the application of machine learning approaches demand flexible and scalable infrastructure. In this work, the Amazon Web Service (AWS) Cloud platform was chosen as the infrastructure provider as it has a diverse range of stable services and features. Figure 1 illustrates the high-level architecture that was utilized in this work including the Twitter social media harvesters, the storage systems and the data processing and analysis components. In this, figure, the grey lines indicate the data flow through the system.

Fig. 1.
figure 1

System architecture.

In this work, the harvesters ran on the AWS Elastic Compute Cloud (EC2) platform with four processes running in parallel to harvest tweets through the Twitter Search API. Two processes were dedicated to harvesting tweets from CredBank and FakeNewsNet by matching the Tweet IDs provided in the dataset. The harvested tweets were stored locally on each instance where they could be retrieved for pre-processing. The pre-processed tweets together with the labels specified in the two datasets were then used by the analyser.

Since the project only focused on Twitter (tweet) texts and their labels, the data that was required to be stored had a relatively simple structure. Therefore, the AWS Simple Storage System (S3) was selected as the back-end storage system. S3 buckets were established with one as the primary storage, the other provided cross-region replication and acted as the backup. The pre-processed tweets were stored as five large JSON files in buckets based on their associated categories. The analyser was implemented using AWS SageMaker, a dedicated machine learning platform that provides an integrated Jupyter notebook instance [8]. The analyser retrieved data from S3 buckets using the Python Boto library. It then performed the sentiment analysis and classification tasks, before storing the results.

Allcott and Gentzkow identified that the diffusion pattern of false news goes significantly further, faster, deeper, and broader than more truthful (non-fake) news [9]. The reasons behind this can be multi-fold, e.g. an artefact of human nature or based on a penchant for conspiracy. This applies to many domains and contexts. In the finance domain, for example, finance-based fake news is often offered as “clickbait” for advertisements that are profit-driven and tasked with attracting website traffic [10]. This can trigger a stronger emotional response that lends itself as a trigger for further spreading of disinformation. Conspiracy-based fake news focuses on promoting particular ideas, advancing certain agendas, or discrediting others. Again, such information is intended to elicit strong emotional responses. Political and fake gossip news is a common type of conspiracy-based fake news. As a mainstream news and social networking service, Twitter is recognized as a major spreader of fake news. While Tweet texts are generally not the sole source of fake news, e.g. they may include links to websites containing the erroneous information. Through these links, users may be misled to spreading such tweets and hence disseminating the fake news. As such, the strength of emotion that fake news can trigger is important. Fake news that no-one cares about is unlikely to be redistributed (retweeted). Based on this, we focus on three key considerations:

  • Establishing the sentiment of fake and real news as recorded in tweets;

  • Understanding the characteristics and textual features of tweets that can potentially be used to distinguish between real and fake news, and finally

  • Whether we can actually distinguish fake and real news tweets based on their content.

In exploring these considerations, we consider two underlying hypotheses regarding the sentiment of real and fake news:

  • Hypothesis 1. Fake news tweets are expected to include more emotive content, i.e. more positive or more negative content;

  • Hypothesis 2. Language features can be used to classify and hence distinguish real and fake news.

4 Methodology

To explore these hypotheses, we apply traditional machine learning methods and a neural network trained as a classifier incorporating the sentiment analysis of fake and real news in social media. FakeNewsNet contains data sets from both BuzzFeed and the PolitiFact [21] platforms, including two categories of news: politics and gossip. These data sets were provided as four different CSV files: politifact_fake.csv; politifact_real.csv; gossipcop_fake.csv and gossipcop_real.csv. Each CSV file had four columns: -id: identifying each news uniquely; -URL: the URL link of original news posted on the network; -title: the news article title, and the -tweet_ids: the unique Tweet ID related with the tweet used to share the news.

The CredBank data also includes a Credibility Annotation File with 1,377 topics where each topic was rated by 30 independent and external public assessors with scores ranging from −2 to +2, reflecting the credibility of the news from low to high together with the corresponding reasons for the score [4]. Each topic comprised numerous tweet ids. The average ratings of the assessors and the overall credibility of the tweets was used for the individual topics. In this work we selected 88 topics with average ratings above zero, i.e. the tweets could be considered as real news. The total number of real and fake news used is shown in Table 1.

Table 1. Data distribution

Having harvested social media data, it was necessary to process the original data since there could be redundant information. Removing replicas is a typical first step in data processing. Since news collected from FakeNewsNet is not always English news, removing non-English content is important. The Python library langid was used to determine the tweet language. All real news was initially labelled “0”, while fake news was labelled as “1”. The processed data was stored in compressed JSON format. Each data comprised two key-value pairs including “text” for news content and “label” for the news class. After pre-processing, the valid news articles were reduced with results shown in Table 2.

Table 2. Data distribution after pre-processing.

In order to more efficiently conduct sentiment analysis on fake and real news in social media, it is essential to remove unnecessary information, such as URL links, stop words, emojis and punctuation. Regular expressions were used for data cleaning and the Python library spaCy used for word tokenization and lemmatization. This library also facilitates the identification of relevant features.

However, as can be seen from Table 3, the distribution of news data is not balanced which can cause errors when training the models. Therefore, sampling methods were utilised including under-sampling and over-sampling. Under-sampling is used to randomly sample news data for each kind of news, while FakeNewsNet comprises significantly less data. Thus, over-sampling was used to balance the disparity between the amount of fake and real news. The Python library imblearn provides an algorithm (SMOTE) for solving such unbalanced data issues. To test the method, we collected news samples in initial experiments and identified that the classifiers performed better when trained by over-sampling the data compared to under-sampling. As a result, over-sampling was adopted as the method.

Table 3. Performance of sampling methods.

Feature engineering and especially feature selection play a very important role in machine learning. Feature selection is used to deal with data dimension issues with sparse data samples or where there is difficulty in calculating Levenshtein distances. Feature selection removes unrelated features which can help to speed up model training, reduce training vectors and improve the overall model performance.

There are many algorithms for feature selection, such as removing features with low variance, univariate feature selection, and loop feature selection [22]. In this work, the F-value belonging to univariate feature selection methods was used as part of the F-test to determine the features with high correlation rates between features and their dependent variables. The Scikit-learn in Python provides a method “f-classif” supporting this algorithm and was applied.

The feature number is also a critical factor affecting the performance of machine learning models. In this work, 10-fold cross-validation was used to evaluate the performance of the model with different numbers of features. Figure 2 illustrates one model (Decision Tree) trained with 2000 features. This model performed best in this project.

Fig. 2.
figure 2

Model performance with changing of numbers of features.

Naive Bayes was trained as the baseline, and the pre-processed data set then utilised for training the decision tree classifier. Both of the classifiers mentioned above were trained with methods that Python Scikit-learn provides. Having considered that different models perform differently on different news data sets, the training and test data sets were divided into three types: datasets with real and fake news for politics only, datasets with real and fake news for gossip only, and datasets with real and fake news comprising mixed news categories. During the training process, classifiers were used to classify the data into two categories: fake news and real news. After the model training was complete, the test data sets were evaluated for Accuracy, Precision and Recall as evaluation metrics to detect the performance of the trained models. Table 4 shows the results of the different approaches.

Table 4. Performance of different methods trained with different training datasets.

As can be seen from Table 4, the decision tree approach generally performed better than Naive Bayes. A part of the decision tree showing the construction of the decision tree models is shown in Fig. 3(a) where each node reflects one feature acting on the decision tree. The thumbnails of the decision tree classifiers as trained by the different training sets are also shown in Fig. 3(b).

Fig. 3.
figure 3

Decision tree construction

As can be seen, there are fewer branches in the politics decision tree, i.e. the decision tree can more readily classify real news and fake news for political news. In contrast, the decision tree trained by mixing data sets performed poorly with many more branches.

In order to compare the prediction results between different categories of news, a confusion matrix was used as the evaluation method. This reflects the true condition of classifiers directly. Figure 4 shows the results of the confusion matrix. For the predictive classification model, the more accurate the prediction, the better the classifier. Here the second and fourth quadrants visualise the number of True Positive and True Negative objects respectively. As seen, it is more probable to predict fake news correctly for gossip news, whilst it is more probable to predict real news correctly for political news.

Fig. 4.
figure 4

Confusion matrix of different categories of news.

As seen, using machine learning generates a large computational matrix, which can make it difficult to be utilized at larger scale, i.e. with real fake news scenarios using big data. To address this, a neural network was used as an alternative method.

Before input texts can be fit into a neural network for textual entailment, a conversion of text to its vector representation is needed, i.e. word embedding, since neural networks prefer well defined fixed-length inputs. By embedding words in documents, the information is encoded, and the relation between the words can be established. Typically, word embeddings are categorised using frequency and prediction-based approaches. Frequency-based embeddings such as TF-IDF,Bag-of-Words and Global Vectors for Word Representation (GloVe) [11] simply count word occurrences, while prediction-based approaches such as CBOW, Skip Gram, and Word2Vec take the context into consideration. Instead of manually training vector representations, pre-trained models for off-the-shelf usage exist. These are timesaving, statistically robust, and can leverage massive corpora. Among those models, two of the most popular ones are Word2Vec and GloVe. In Glove, word-word co-occurrence probability and unsupervised learning are used [11]. In this work, GloVe was chosen for word embedding as the initial weight for the neural network classification. It captures sub-linear relationships between two words in the same position in the vector space. It has been shown to have better performance in word analogy tasks. For deeper analysis, machine learning approaches were also applied.

LSTM is a variation of recurrent neural networks (RNN) [13]. An RNN takes both current and the recent past steps as inputs and combines them to determine how to respond to new data through a feedback loop. In this way, it behaves like memory. To find the correlation between input sequences through multiple steps or long-term dependencies, each input is stored in RNN’s hidden state and is cascaded forward to affect processing of the inputs in each step [12]. The aim of training an RNN model is to minimize the difference between the output and the correct label using given inputs by adjusting a weight matrix through an optimizer and loss functions.

Unlike traditional RNNs, where long-term contexts can be neglected due to the vanishing gradient problem on back propagation, LSTMs are able to preserve the error and learn through remote contexts using multiple switch gates. Bidirectional LSTMs (Bi-LSTMs) train two instead of one LSTM on an input sequence, with the first one being the original sentence embeddings, and the second being the reversed copy. This can provide context to the left and right and result in faster and richer learning on language problems [13]. Bi-LSTMs also have a relatively good performance for classifying sequential patterns as often arise in texts [14].

In this work, the Python Keras library with TensorFlow as a backend was chosen as the implementation tool with the input, hidden, and output layers configured as follows. In the input layer, each pre-processed tweet text was set as a single input. Tokenization was performed using Keras Tokenizers for sequence embeddings with length padded to 100 to provide uniformity. The padded embeddings were then fed into each neuron as data input. The initial weights of the model were set using the pre-trained GloVe embeddings of the input sentences, where words were converted into vectors according to their frequencies in documents and position in sentences. The hidden layer consisted of a Bi-LSTM Layer and two Dense Layers with the parameters as listed in Table 5.

Table 5. Hidden layer parameters setting

The Activation Function Rectified Linear Unit (ReLU) was selected as the activation function to transform the weighted sums to activation in the hidden layer to overcome vanishing gradient problems. A Binary Cross-Entropy function was chosen since the output class is binary. This saves memory as it does not require the target variable to be one-hot encoded prior to training. The Adam optimizer was used to update the weights and minimizing the loss function since it has faster convergence [16]. The dropout used to prevent overfitting and to improve generalization error by randomly ignoring some outputs was set to 0.5 for the Bi-LSTM layer. This is close to the optimal experimental value for a wide range of networks and tasks as shown by [17]. In the output layer, a soft-max activation function was used in an Output Dense Layer to aggregate the results and to give the final probabilities.

As is described previously, tweets were collected related to different topics. In order to analyse whether fake news could be detected purely based on the content, the influence of the topic must be considered. To tackle this, the datasets were partitioned in two different ways. In the first partition, tweets from FakeNewsNet and CredBank were mixed and then divided to form training and testing sets to test the performance of the news topics in the training set. The second partition takes only the FakeNewsNet data as the training set and CredBank Real News as the test set to test the performance when related news topics are new to the classifier. The number of partitioned tweets is shown in Table 6. It is noted that the number of fake news collected in CredBank is zero since the average ratings are chosen to judge whether the news is real or fake, and only a few of the news tweets had an average rating below zero. If those news items were added, the dataset would be seriously unbalanced.

Table 6. Training and testing dataset partition settings.

5 Results and Analysis

In order to analyze the sentimental polarity of the tweet text and the words, the Python libraries textblob and nltk.corpus were used.

To explore Hypothesis 1 (Fake news tweets are expected to include more positive or more negative content), the sentiment analysis on the FakeNewsNet data set involved classifying the tweets based on their sentiment, i.e. whether they expressed a positive, negative or neutral sentiment. The results are shown in Fig. 5.

Fig. 5.
figure 5

Sentiment analysis on tweet content (FakeNewNet Dataset).

From Fig. 5 above, it can be seen that for all types of news tweets, emotional (positive and negative) content take up a larger portion compared to neutral content, which suggests Hypothesis 1 is not true. However, if we investigate the relationship between positive and negative sentiments in more detail, it can be observed that the different types of news exhibit different patterns. The relationship between the portion of positive and negative sentiments in real and fake news tweet, and the portion of these two sentiments types in real news tweets, when compared to fake news tweets, is shown in Table 7. As seen in Table 7 for all of the three types of real news tweets, positive content takes up a larger portion compared to negative content, while for fake news tweets, only political news related tweets have more negative sentiment than positive. This suggests that sentiment analysis could help to detect political fake news as it presents a different pattern compared to real news. For all these three types of news, positive sentiment takes up a larger portion of real news compared to fake news, while the percentage of negative sentiment is larger in fake news. Thus, it can be considered that more real news tweets tend to have positive sentiment, while more fake news Tweets present negative sentiment. Note that the result may be biased based on the number and balance of real and fake news tweets in the three categories.

Table 7. Sentiment portion in real and fake news.

To explore Hypothesis 2 (Language features can be used to classify and hence distinguish real and fake news) the sentiment distribution of the top 2000 features that best distinguish fake and real news tweets for gossip, political news and combinations of both are illustrated in Fig. 6.

Fig. 6.
figure 6

Sentiment distribution of the top 2000 features (FakeNewsNet dataset).

It can be observed from Fig. 6 that the majority of features are neutral. Positive and negative features only take up a small proportion of the features, and their number is (relatively) similar across all the three categories, which suggests Hypothesis 2 is not true. To investigate the characteristics of good features, 10 features with the highest contribution to the decision tree for real and fake news Tweets of each category were considered and plotted as shown below in Fig. 7.

Fig. 7.
figure 7

Top 10 features with best contributions (FakeNewsNet dataset).

As seen from Fig. 7, features with the largest contributions are related to news topics selected in the FakeNewsNet dataset. The most typical feature in gossip news is names such as ‘Brad’ and ‘Angelina’. For political news, words typically refer to organizations such as ‘NASA’ or ‘press’. Although there are correlations between these features, the predictability of these features cannot be guaranteed, and new topics may be included in other scenarios and in other contexts. Another observation is that the contribution of political fake news features is relatively higher than other categories, which may imply that political fake news can more easily be detected. The Bi-LSTM classification results in two differently partitioned data sets as shown in Table 8:

Table 8. Bi-LSTM classification results.

As can be seen from Table 8, mixing training and testing sets has a much better performance than separated ones. This indicates that the prediction ability of Twitter content is highly dependent on the topics identified in the training set. In the real world, when an unrelated (new) topic is introduced, the classifier would be more likely to give a wrong prediction if only the news content was used.

It is noted that the limitation of this test is that it only includes the real news in a separated test set, as only a few topics were classified to be fake on average in the CredBank data set. However, as the difference between separated and mixed partition on real news classification results is considerable, it can be deduced that the situation for fake news would not be much better.

6 Conclusions and Future Work

This work focused on the analysis of fake and real news through social media using decision trees as the classifier to train and establish the contributions of features. A Bi-LSTM model was used as a classifier with two differently partitioned training and testing sets used to analyze whether fake news could be detected purely based on its content. It was discovered that for all types of news related tweets, emotional content takes up a larger portion compared to neutral content, with more real news tweets tending to have a more positive sentiment, and more fake news tweets having a more negative sentiment. Of the three categories, political news was the only one that presented different sentiment patterns for fake and real news-related tweets. Although sentiment analysis could help to detect fake political news, the majority of the features that distinguish them was shown to be neutral. Typical patterns included in features with the best contributions included individual names and words related to organizations. However, as these features are all topic-related, it was identified that the prediction ability of Twitter content was highly dependent on the topics it had seen before, which was reinforced through the results of Bi-LSTM classifications.

The source code for this work is available at https://github.com/lixuand1/Fake-news-and-social-media/tree/master/code.