Keywords

1 Introduction

In recent years, social media has provided great convenience to the public because of it’s more timely and more easier to share and discuss across various social media platforms. However, social media has become an ideal place for fake news propagating due to the lack of supervision mechanism. And the widespread of fake news on social media arise tremendous negative impacts on individual and public. For instances, many commentator believe that the result of the 2016 US presidential election was affected by fake news [1].

In order to mitigate the negative impact of fake news, several fact-checking organizations take enormous labour and times to identify fake news from massive Internet content. The automatic detection of fake news is a critical step in our fight against fake news. Early works for detect fake news often designed a comprehensive sets of hand-crafted features, which is lack of generalization and almost impossible to design all-encompassing features since fake news are usually written across different types. Moreover, we don’t have a comprehensive grasp of the linguistic characteristics of fake news yet. There are still great limitations to the methods that only based on text content alone.

On the other hand, news often accompanied with related contextual information, such as speaker of news and there historical data, etc. These contextual information provide more useful information beyond textual content for detecting fake news. In order to learn the high-level feature representation from contextual information, previous works mainly use deep learning methods to automatic capture the useful features. Specifically, contextual information has be seen as a sequence, then Convolutional Neural Network (CNN) based and Recurrent Neural Network (RNN) based methods are applied to capture the dependencies. However, these methods have their own shortcomings. For example, CNN-based methods only can learning local patterns while fails to capture the global feature representation. RNN-based methods usually take the input sequence as an ordered sequence and update the hidden states step by step. Based on this, it’s natural for us to ask:

  • Will different sequences order lead to quite different results?

  • If the first hypothesis is true, then how did we determine the sequence order that can achieve a better performance?

To explore the influence of sequential of contextual information, we conduct a simple experiment based on LIAR dataset [2]. LIAR dataset contains a wealth of contextual information. We utilize a LSTM to encode the contextual information. Obviously, different order of sequence will affect the learning procedure of LSTM. We take followed sequence as examples:

  • S1: credit history of speaker, context, speaker name, subject of statement, job title of speaker, home state of speaker, party affiliation of speaker.

  • S2: context, credit history of speaker, speaker name, subject of statement, job title of speaker, home state of speaker, party affiliation of speaker.

  • S3: subject of statement, speaker name, job title of speaker, party affiliation of speaker, credit history of speaker, home state of speaker, context.

  • S4: credit history of speaker, context, party affiliation of speaker, speaker name, job title of speaker, home state of speaker, subject of statement.

Figure 1 shows the performance in different orders. As Fig. 1 shows, the performance is affected by the sequential. For example, the detection accuracy get 32.5% when input sequence take S3 while only get 25.7% when take S4. Given the fact that the number of all combinations of sequence order may be very large, it’s not wise and inefficient to try all the possible combinations and then choose an optimal sequence manually.

Fig. 1.
figure 1

Performance of LSTM on different order of contextual information.

To address above-mentioned challenges, in this paper, we proposed a novel hybrid model based on deep learning to automatic learning the features for fake news detection. Specifically, a Text-CNN [3] is applied to learn features from textual content. Inspired by transformer technique, we utilize multi-head self-attention to capture the dependencies between contextual information and learning a global feature representation of contextual information to assist in detecting fake news. Self-attention mechanism can ignore the influence of the relative position of each element in the sequence and capture the global representation of sequence. The contribution of this paper can be summarize as follows:

  • Through experiments, we confirmed that performance of fake news detection is affected by the sequential of contextual information.

  • We proposed a novel CMS model that can ignore the influence of sequential and learning a global representation from contextual information for fake news detection.

  • Experiments on real-world fake news dataset demonstrate the effectiveness of the proposed CMS model.

The rest of the paper is organized as follows: Sect. 2 will briefly present the related work. The details of the proposed method will be introduced in Sect. 3. Experimental results are presented in Sects. 4 and 5 conclude this study.

2 Related Work

Fake news detection has attracted great concern in recent years due to it’s tremendous negative impacts on public. With the help of big data technology [4,5,6] and machine learning, automatic fake news detection has made great advance. Early works on detecting fake news mainly focus on designing a complementary set of hand-crafted features based on linguistic features [7,8,9,10,11]. Unfortunately, we don’t have a comprehensive grasp of the characteristics of fake news yet. And this procedure require large efforts to explore the effectiveness of manual features.

Recently, deep learning based methods were proposed to automatic learning the patterns to detect fake news. For example, Ma et al. [12] proposed a deep neural network based on RNN to capture the temporal and textual features from rumour posts. Liu and Wu [13] utilize CNN and GRU to capture the useful patterns from user profiles. Multi-Modal fake news detection that integrate the textual feature and visual feature get a certain improvements compared to textual feature only [14, 15].

Wang [2] presented a new publicly available dataset for fake news detection in 2017. It included 12.8k manually labeled short statements and each statements accompanied with rich contextual information, i.e., speaker name, job title, political party affiliation, etc. Wang also proposed a Hybrid CNN to detect fake news based this dataset. Based on LIAR dataset, some hybrid deep neural network methods were proposed to facilitate fake news detection. For instances, Long et al. [16] applied LSTM and attention mechanism to capture the patterns and Karimi et al. [17] proposed Multi-Source Multi-Class Fake News Detection (MMFD) models to discriminate different degrees of fakeness. However, above methods ignore the influence of sequence order of contextual information as we discussed in Sect. 1.

Different with above mentioned methods, we proposed a novel model CMS that can ignore the sequential order and capture the global representation from contextual information for fake news detection.

3 Methodology

3.1 Problem Definition

Let \(X=\{x_{1}, x_{2}, ..., x_{N}\}\) be a set of news, and each news can be denoted as \(x= \{ s,c\}\), where sc represent the statement and contextual information of the news item, respectively. Each statement is consisted of several words, which can be denoted as \(s= \{ w_{1}, w_{2},..., w_{n} \}\), where n is the length of statement. The representation of contextual information can be denoted as \(c=\{ c_{1},c_{2},...,c_{t} \}\), where t is the number of features in contextual information, such as speaker name, subject of statement, etc. \(Y={\{y_1,y_2,...,y_{N}\}}\) denotes the corresponding label. Our goal is using the news set X and the corresponding label set Yto learn a multi-classification model \(\mathcal {F}\), which can predict the fakeness for unlabeled news.

3.2 Model Overview

Figure 2 depicts the architecture of CMS. CMS consists of two main components, a module for extracting the linguistic feature of news statement, and a module for capturing the contextual dependencies. Specifically, CMS is composed of three parts as followed:

Fig. 2.
figure 2

The structure of CMS.

  • Linguistic Feature Extractor: We use a Text-CNN to extract linguistic representation from statement. Statement is represented as word embedding matrix through word embedding layer and then Text-CNN is applied to learning an representation vector \(\varvec{s}\).

  • Context Encoding: To extract the global representation of contextual information and ignore the influence of the sequence order in contextual information, we use multi-head self-attention mechanism to encode the contextual information and outputs a high-level representation vector, denoted as \(\varvec{c}\).

  • Integrate: The outputs of the two above components are concatenated and finally obtain the hidden representation of news, denoted as \(\varvec{d}\), which will fed into a fully connected layer for classification.

Linguistic Feature Extractor.We use Text-CNN [3] to extract linguistic features from statement. Statement is represented as a matrix of word embeddings denoted as \(\varvec{W}\in \mathbb {R}^{n\times e}\), where e is the dimensions of the word embedding. Text-CNN usually applied multiple kernel with multiple filters to extract features with different granularities. A filter \(\varvec{f}\in \mathbb {R}^{l\times e}\) is convolved with word embedding matrix and obtain a corresponding feature map \(\varvec{P}\in \mathbb {R}^{n-l+1}\), where l is the length of the filter. For each \(p_{j}\in \varvec{P}\), the filter operation can be formulated as follows:

$$\begin{aligned} p_{j}=\mathcal {C}(f\odot W_{j:j+l-1}) \end{aligned}$$
(1)

where \(W_{j:j+l-1}\) stand for l consecutive words starting with the j-th word in the statement, and \(\mathcal {C}(\cdot )\) is a non-linear activation function, such as ReLU. We use max-pooling on each feature map to get the most important information:

$$\begin{aligned} v = \text{ Max-Pooling }(P) \end{aligned}$$
(2)

Now, we get the corresponding feature for one particular filter. The complementary feature for statement can be obtained by concatenating all pooling feature vectors:

$$\begin{aligned} \varvec{s}= [v_{1},v_{2},...,v_{k}] \end{aligned}$$
(3)

where k is the number of filters.

Context Encoding. In order to capture the global representation of contextual information without regard of the order of features, we employ a variant of transformer [18] as the core of the module.

Usually, the credibility of news is context-dependent, and the interactions between features in contextual information are important to fake news detection. Self-attention is used to capture the informative interactions between features in contextual information. In our case, the self-attention process is based on the assumption that all features in contextual information are closely related and disorder. The output of an attention function is a map of a query and a set of key-value pairs, where the query, keys, values, and output are all vectors. The weighted sum of values are the output, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. In practice, a set of queries, keys and values are packed together into matrices \(\varvec{Q}\), \(\varvec{K}\), \(\varvec{V}\), respectively. Contextual information is represented as a matrix of word embedding denoted as \(\varvec{C}\in \mathbb {R}^{t\times e}\). \(\varvec{Q}\), \(\varvec{K}\), \(\varvec{V}\) are three different subspace matrix of \(\varvec{C}\), which is get by linear projection. So the process of self-attention with contextual information can be formulate as:

$$\begin{aligned} Attention \left( \varvec{Q},\varvec{K},\varvec{V} \right) = softmax\left( \frac{\varvec{Q}\varvec{K}^{T}}{\sqrt{d}_{k}}\right) \varvec{V} \end{aligned}$$
(4)

where \(d_{k}\) is the dimension of vector in \(\varvec{Q}\). Now, a new representation of contextual information is got through a self-attention head. For the purpose of allowing the module to obtain high-quality representation of features, multi-head self-attention is used to drawing features interactions with different subspace of \(\varvec{C}\) jointly:

$$\begin{aligned} MultiHead \left( \varvec{C} \right) =Concat \left( Attention_{1},...,Attention_{H} \right) \end{aligned}$$
(5)

where H is the number of heads. After that, a fully connected layer is followed and the output size of the full connected network is equal to \(\varvec{C}\). In particular, the above compute process can be repeated K times, and the output of previous layer is the input of next layer. Taking a max-pooling operation on the output of last layer, denoted as \(\varvec{C}'\in \mathbb {R}^{t\times e}\), and obtain a global representation of contextual information:

$$\begin{aligned} \varvec{c}=\text{ Max-Pooling }(\varvec{r}_{1},\varvec{r}_{2},...,\varvec{r}_{e}) \end{aligned}$$
(6)

where \(\varvec{r}_{i}\) is the i-th columns in \(\varvec{C}'\). So the final representation of contextual information \(\varvec{c}\in \mathbb {R}^{e}\).

Integrate. The final feature representation of news, i.e., \(\varvec{d}\), can be obtained by concatenating \(\varvec{s}\) and \(\varvec{c}\). After that, \(\varvec{d}\) is fed into a fully connected layer followed by a softmax activation function to predict the label of news:

$$\begin{aligned} \varvec{l} = softmax(\varvec{W}^{T}\varvec{d}+b) \end{aligned}$$
(7)

where \(\varvec{l}\in \mathbb {R}^{M}\), M is the number of class. The loss function can be formulated as:

$$\begin{aligned} \mathcal {L}=-\frac{1}{N}\sum _{i=1}^{N}\varvec{y}_{i}\log {\varvec{l}_{i}}+\frac{\lambda }{2}\left\| \varTheta \right\| _{2}^{2} \end{aligned}$$
(8)

where \(\varvec{y}_{i}\) is the corresponding ground-truth, \(\varTheta \) denote the parameters of CMS, and \(\lambda \) is the trade-off coefficient of \(L_{2}\) regularizer.

4 Experiment

4.1 Dataset

In order to evaluate the performance of CMS, we conducted a series of comparative experiments in a real-world dataset LIAR published in [2]. LIAR dataset contains a total of 12.8k manual labeled short statements, and each statement contains rich contextual information. Table 1 show the details of LIAR dataset.

Table 1. Details of LIAR dataset

4.2 Experimental Setting

We use pytorchFootnote 1 to implement the proposed model. We use 300-dimension pre-trained word2vec embedding [19] to initialize the word embeddings. We utilize batch size 64, and Adam optimizer [20] is adopted as the optimizer with learning rate 0.0001. \(L_2\) regularization coefficient is \(1e^{-4}\). In Linguistic Feature Extractor, we take (2,3,4) as kernel size and each kernel has 50 filters. As for Contextual Encoding component, the number of layers and heads is 3, 2, respectively. We use the average accuracy of 10 trials as the performance metric.

4.3 Performance Comparison

To demonstrate the effectiveness of CMS, we compare the following methods:

  • Random: Randomly selecting the class of a test sample.

  • Majority: Majority method choose a label that is the greater part of total dataset. In our experiment, we labeled each test sample as half-true.

  • Text-CNN: Text-CNN model only use statement to classify news.

  • Hybrid-CNN: Text-CNN is applied to learning textual feature from statements, and a CNN followed by a Bi-LSTM was applied to extract feature from contextual information.

  • LSTM-Attention: Two LSTM was applied to extract features, one for textual context with attention, and another for contextual information.

  • MMFD: CNN-LSTM was proposed to extract features from textual content and contextual information. Then the features were summed with weights by attention mechanism.

It is worth noting that the above-mentioned methods, Hybrid-CNN and LSTM-Attention use different combinations of contextual information and select a best performance manually, while our proposed model can be fully automated to achieve this goal. Table 2 shows the performance of all methods. As we can see, Text-CNN achieve worst performance since it only based on textual content of statement. Hybrid-CNN, LSTM-Attention and MMFD get a certain improvements compared to Text-CNN. Beyond the mentioned contextual information, MMFD also add the verdict reports generated by experts in politifact.com to further improve the detection accuracy. Even though, our proposed model still gets 6.5% improvements on detection accuracy compared to MMFD. Compared to LSTM-Attention and Hybrid-CNN, CMS can ignore the sequence order and capture the global features from contextual information and finally get 45.3% accuracy.

Table 2. Performance of detecting fake news.
Fig. 3.
figure 3

The performance of CMS and compared methods on different sequence orders.

4.4 Effectiveness of CMS

To evaluate the effectiveness of CMS, we conduct a comparison experiment to illustrate the superiority of CMS. We take several sequence orders and apply different methods to extract features from these sequence order. For the sake of fairness, we keep the textual features extractor as Text-CNN to all compared methods. Figure 3 shows the detailed results. The input sequence keeps consistent with Sect. 1. It is clearly show that the performance of Hybrid-CNN, LSTM-Attention and MMFD is severely constrained by the sequence order. For example, all compared methods achieve highest accuracy when adopt sequence S3 and get 32.5%, 40.0%, 41.5%, respectively. Such results may indicate that sequence S3 is a more appropriate order when extract features from contextual information. However, the performance can be very bad when those methods meets some not so appropriate sequence order. For example, LSTM-Attention only get 25.7% accuracy when the input sequence is S4.

Table 3. Difference of sequence order.

In contrast, CMS always maintaining a better performance no matter how the sequence order is. Table 3 shows the detail results. As we can see, other methods have a huge gap between the best sequence order and worst sequence order, while CMS only have 0.5% difference. Hence, our proposed model not only achieve better performance, but also keeps stable on accuracy. This further illustrate that CMS can neglect the order influence and capture the global patterns effectively.

5 Conclusion

In this paper, we proposed a novel hybrid model based on deep neural network to automatic learning the useful features from textual content and contextual information. Inspired by transformer technique, we applied multi-head self-attention to extract features from contextual information. Multi-head self-attention mechanism can ignore the distance and guarantee that each contextual information will be considered when capture the dependencies from contextual information. Experimental results further demonstrate the effectiveness of CMS method.