Multimodal Matching-Aware Co-Attention Networks With Mutual Knowledge Distillation For Fake News Detection
Multimodal Matching-Aware Co-Attention Networks With Mutual Knowledge Distillation For Fake News Detection
Multimodal Matching-Aware Co-Attention Networks With Mutual Knowledge Distillation For Fake News Detection
8, AUGUST 2021 1
Abstract—Fake news often involves multimedia information fusion. Fake news could have mismatched images and text, in
arXiv:2212.05699v2 [cs.IR] 12 Apr 2023
such as text and image to mislead readers, proliferating and which case directly applying co-attention between the different
expanding its influence. Most existing fake news detection modalities could lead to under-performed multimodal features
methods apply the co-attention mechanism to fuse multimodal
features while ignoring the consistency of image and text in co- for fake news detection.
attention. In this paper, we propose multimodal matching-aware To address the limitation, we propose a novel image-text
co-attention networks with mutual knowledge distillation for matching (ITM) aware co-attention mechanism to capture the
improving fake news detection. Specifically, we design an image- matching degree of image and text while learning multimodal
text matching-aware co-attention mechanism which captures the fusion features. Although some existing works also pay at-
alignment of image and text for better multimodal fusion. The
image-text matching representation can be obtained via a vision- tention to the consistency of different modality content in the
language pre-trained model. Additionally, based on the designed news by combining the consistency based additional features
image-text matching-aware co-attention mechanism, we propose with multimodal features for fake news detection [9], [10],
to build two co-attention networks respectively centered on text [11], they do not consider the image-text alignment in learning
and image for mutual knowledge distillation to improve fake news multimodal fusion features that play a critical role in fake
detection. Extensive experiments on three benchmark datasets
demonstrate that our proposed model achieves state-of-the-art news detection. Additionally, we propose to conduct mutual
performance on multimodal fake news detection. learning between two ITM-aware co-attention networks that
are respectively centered on text (text features as query)
Index Terms—Image-text Matching, Mutual Knowledge Dis-
tillation, Fake News Detection and image (image features as query), which enables them to
learn collaboratively with mutual knowledge distillation for
improving fake news detection.
I. I NTRODUCTION Overall, in this paper, we propose novel Multimodal
Online social media has become an indispensable platform Matching-aware Co-Attention Networks with mutual knowl-
for people to share and access information in their daily life. edge distillation (MMCAN, shown in Figure 1(b)) for improv-
Due to the loose constraints for user-generated content on ing multimodal fake news detection. Specifically, we first ob-
social media, there could be plenty of fake news that distorts tain the ITM representation via a vision-language pre-trained
and fabricates facts, which could mislead readers and even model such as ViLT [12]. Then we design a new ITM-aware
cause a great negative impact on society. Fake news usually co-attention mechanism, which can learn better multimodal
uses multimedia information such as text and image to draw fusion features relying on the alignment of image and text in
users’ attention and expand its influence. It has become urgent the news. Based on the proposed ITM-aware co-attention, we
and important to detect multimodal fake news on social media. build two co-attention networks respectively centered on text
Many efforts have been made on fusing textual and visual (text features as query) and image (image features as query) for
features for multimodal fake news detection. As shown in multimodal fake news detection. Moreover, mutual learning
Figure 1(a), existing studies conduct multimodal fusion via between two co-attention networks is exploited to enable
simple concatenation, auxiliary tasks, or co-attention mech- knowledge distillation from each other for collaboratively
anism. Early studies [1], [2], [3] combine the two modality improving fake news detection. Extensive experiments on three
features by simple concatenation. Some studies improve them benchmark datasets demonstrate that our model achieves state-
by introducing auxiliary tasks such as feature reconstruction of-the-art performance in multimodal fake news detection.
[4] and event discrimination [5] to enhance multimodal fea- In summary, our main contributions are as follows:
ture learning. In order to further capture the inter-modality • We propose novel multimodal matching-aware co-
correlations, recent studies adopt co-attention networks for attention networks with mutual knowledge distillation for
fine-grained modality interactions to detect fake news [6], [7], improving fake news detection.
[8]. However, they fail to consider the matching degree of the • We design a new ITM-aware co-attention mechanism
image and text in the co-attention mechanism for multimodal for learning better multimodal fusion features, guided by
the alignment of image and text in the news. Moreover,
This paper was produced by the IEEE Publication Technology Group. They
are in Piscataway, NJ. mutual knowledge distillation of two co-attention net-
Manuscript received April 19, 2021; revised August 16, 2021. works based on the new co-attention mechanism is also
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2
Methods (1)
Text Text Encoder
+ Classifier
Image Image Encoder Multi-modal
Features Methods (2)
Auxiliary Tasks
Methods (3)
Multi-modal Text Text Encoder ×
News 1−#
Co-attn × + Classifier
#
Image Image Encoder ×
1−#
Image-text Alignment
Information
(a) Existing Methods
ITM-aware Text-centered
Co-attn Network Text-centered Classifier
Multi-modal Features
employed to further improve fake news detection. for detecting fake news. To avoid feature engineering, Qi et al.
• Extensive experiments demonstrate that our model MM- [23] utilized a visual neural network to effectively capture and
CAN achieves state-of-the-art performance on three pub- fuse the characteristics of fake-news images at both physical
lic benchmark datasets for the multimodal fake news and semantic levels.
detection task. 2) Multimodal Content Based Methods: Multimodal fusion
features have been shown to play a critical role in fake news
II. R ELATED W ORK
detection. Early studies [1], [2], [3] mainly focus on designing
A. Fake News Detection more advanced feature extractors for multiple modalities,
Fake news is the news that is intentionally fabricated and and then the multimodal fusion is fulfilled simply by the
can be verified as fake [13], [14]. Existing studies on fake concatenation operation. Some studies utilize auxiliary tasks
news detection can be divided into unimodal methods and such as feature reconstruction [4] and event discrimination [5]
multimodal methods. to enhance the multimodal feature learning for fake news de-
1) Unimodal Content Based Methods.: The unimodal fake tection. To model the interactions between the two modalities
news detection methods can be further divided into two types: sufficiently, co-attention based methods have been proposed
textual feature based and visual feature based methods. In for multimodal fake news detection [6], [24], [8]. Concretely,
early studies, the textual features are mostly hand-crafted Qian et al. [24] considered the hierarchical semantics of text
[15], [16], and it is difficult to fully mine the deep semantic and utilized multiple co-attention layers to perform multimodal
information conveyed by the text. To address this problem, interactions. Zheng et al. [8] proposed to fuse textual, visual,
many studies use deep learning technologies to learn the and social modal features via the co-attention mechanism.
textual representation of news towards identifying fake news Some works also pay attention to the consistency between
[17], [18], [19], [20]. For instance, Liao et al. [21] proposed a the text and image, considering news pieces with mismatched
graph based method for learning news representations which text and image are more likely to be fake than those with
capture news relations. matching text and image [9], [10], [25], [11]. For example,
The visual feature have been recognized as an important Chen et al. [11] estimated the cross-modal ambiguity with
indicator for fake news detection. Jin et al. [22] extracted sev- the Kullback-Leibler (KL) divergence between the unimodal
eral visual features to characterize image distribution patterns feature distributions and used the ambiguity score to govern
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3
the aggregation of unimodal features and multimodal features A. Text and Image Encoder
for fake news detection. Zhou et al. [9] measured the cosine We learn the representations of text and image with a
similarity between text and image features for fake news Transformer encoder [31], which is able to capture the intra-
detection, in addition to multimodal feature based fake news modality interactions.
detection. Text Encoder. In order to accurately model the semantic
Different from the above works, to learn better multimodal information of the text T in a piece of news and avoid word
fusion features, we design a new ITM-aware co-attention ambiguity, we employ the pre-trained BERT [32] to obtain
mechanism capturing the image-text alignment. Moreover, we word embeddings. Specifically, the text T is tokenized into a
exploit the mutual learning of two ITM-aware co-attention sequence of m word tokens and we can obtain the embeddings
networks for improving fake news detection. E T ∈ Rm×dt by the last BERT encoding layer, where dt
is the dimension of the word embedding. To capture the
intra-modality interactions among words, we adopt a standard
B. Deep Mutual Learning Transformer encoder layer [31] composed of a multi-head self-
attention and a Feed Forward Network (FFN) to learn the text
Model distillation is an effective and widely used technique representation H T as follows:
to transfer knowledge from a teacher network to a student
one. Nevertheless, in practice, there could be no teacher but H T = Transformer(E T + Epos
T
), (1)
only students. Towards this, Zhang et al. [26] proposed the where H T ∈ Rm×dt , and Epos T
denotes the parameter-free
deep mutual learning, which aims to distill knowledge between positional embedding.
peer students by pushing them to learn collaboratively and Image Encoder. We also employ the Transformer, which
teach each other. Since then, mutual learning has attracted has proved to be effective in many visual understanding tasks
many researchers’ attention [27], [28], [29], [30]. For example, [33], [34], [35], to extract the visual features of the news.
Wei et al. [28] introduced distillation loss between textual and As the standard Transformer takes a 1D sequence of token
visual networks to learn modality correlations for facilitating embeddings as input, we first reshape the image V into a
fake news detection. Zhang et al. [30] performed mutual sequence of flattened 2D patches. Then we use a trainable
learning between RGB and IR modality branches to improve linear projection to flatten the patches as the input for the pre-
the intra-modality discrimination in the cross-modal person trained ViT. In particular, we use the ViT-B/16 [36] pre-trained
re-identification task. Differently, in this work, based on our on ImageNet to get the patch embeddings E V ∈ Rn×dv ,
ITM-aware co-attention mechanism, we build two co-attention where n is the number of patches and dv is the dimension of
networks respectively centered on text and image, and apply the patch embeddings. Analogously, a Transformer encoder
mutual learning between them to collaboratively improve fake layer is adopted for internal interactions of visual modality.
news detection. Formally, we can get the visual representation H V as follows:
H V = Transformer(E V + Epos
V
), (2)
III. METHODOLOGY
where H V ∈ Rn×dv , and Epos
V
is the positional embedding
We propose novel multimodal matching-aware co-attention [31].
networks with mutual knowledge distillation (MMCAN) to
improve the performance of the fake news detection task. B. ITM-aware Co-attention Networks
In MMCAN, a new multimodal matching-aware co-attention In this subsection, we describe our ITM-aware co-attention
mechanism is designed to capture the matching information of networks for learning multimodal fusion features. Specifically,
image and text for learning multimodal representations. Based the co-attention networks employ the newly designed ITM-
on the new co-attention mechanism, two co-attention networks aware co-attention mechanism which can better model the
centered on different modalities are employed to enable mutual inter-modality interactions guided by the alignment of image
knowledge distillation for improving fake news detection. and text. As shown in Figure 2, two ITM-aware co-attention
As shown in Figure 2, the proposed model has four ma- networks respectively focused on textual and visual infor-
jor modules: text and image encoder, multimodal matching- mation are constructed for multimodal fusion representation
aware co-attention networks, fake news classifiers, and mutual learning. Taking the ITM-aware co-attention network focused
learning. Given news with text and image, we first utilize on image as an example, the detail of the co-attention network
two different sub-models to extract features from text and is illustrated in the right part of Figure 2. It consists of an ITM-
image. Then the multi-modality features are fused through two aware co-attention unit for inter-modality interactions and a
multimodal matching-aware co-attention networks respectively self-attention unit for further information interaction to detect
centered on text (text features as query) and image (image fake news.
features as query). After that, the output of the co-attention ITM Representation. To capture the alignment of image
networks is used for fake news classification. Mutual learning and text in our co-attention mechanism for learning better
is applied between the two co-attention networks to enable multimodal fusion features, we first leverage a vision-language
mutual knowledge distillation for collaboratively improving pre-trained model to obtain the image-text matching repre-
fake news detection. sentation. In this work, we use the pre-trained Vision and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
ℒ!% ℒ!"
Mutual Learning
%→" "→% ITM-aware
ℒ#$ + ℒ#$
Co-attention Network
Classifier Classifier
Add & Norm
Text-centered Vision-centered
Co-attention Network Co-attention Network Feed Forward
Self-Attention Unit Self-Attention Unit Add & Norm
Image-text
Matching Vector #' Self-Attention
ITM-aware $(⋅) $(⋅) ITM-aware
Co-attention Unit Co-attention Unit
" # $ # $ "
Add & Norm
!! "
! ITM-aware
Co-attention
Transformer Encoder Vision and Language Transformer Encoder
Transformer !" !! !#
1 2 … m-1 m 1 2 … n-1 n
Fig. 2: The architecture of our MMCAN model. The right part illustrates the ITM-aware co-attention network focused on image.
Language Transformer (ViLT) which takes the ITM task as can be calculated as follows:
one of the pre-training tasks. Formally, given the original text Qi KiT
Atti (H , H ) = softmax( √dt /M )Vi ,
V T
T and the attached image V , we obtain the ITM representation
based on the ITM head of the pre-trained ViLT:
0
MH-Att(H V , H T ) = [Att1 ⊕ · · · ⊕ AttM ]W ,
(5)
H M = ITM-head(ViLT(T, V )). (3) M
α = σ(H M W M + bM ),
It implies the alignment of image and text, where the ITM-
H C = αM MH-Att(H V , H T ),
head is a single linear layer projecting the pooled features to
logits over binary class. where Atti refers to the i-th head of multi-head co-attention,
0
W ∈ Rdv ×dv , W M ∈ Rdm ×n are the weight matrices, bM is
ITM-aware Co-attention Unit. In order to learn better
the bias term, ⊕ denotes the concatenation operation, and
multimodal fusion features guided by the alignment of image
denotes element-wise multiplication.
and text, we design a new image-text matching-aware co-
attention mechanism for cross-modality interaction. Take the
In order to address the degradation problem and normalize
vision-centered co-attention network as an example, as shown
the distributions of intermediate layers, we wrap a residual
in Figure 3, the queries are from the visual features H V , and
connection and LayerNorm (LN) around the matching-aware
the keys and values are from the textual features H T . They
co-attention. Formally, we can obtain the output of the ITM-
are passed as inputs to the multi-head co-attention to model
aware co-attention unit:
interactions between modalities. Formally, the inputs of the
i-th head of co-attention are transformed as follows: e V = LN(H V + H C ).
O (6)
V T T
Qi , Ki , Vi = H WQi , H WKi , H WVi , (4)
where WQi ∈ Rdv ×dv /h (h denotes the number of heads) Self-Attention Unit. After the ITM-aware co-attention, self-
is the projection matrix for queries, and {WKi , WVi } ∈ attention is applied to enable further interactions between
Rdt ×dt /h are the projection matrices for keys and values, modalities for the fake news classifier. Finally, we can obtain
respectively. the vision-centered multimodal fusion features O V :
HSV = LN(O e V + MH-Att(O eV, Oe V )),
Afterwards, we calculate the multi-head attention vector
(7)
based on query, key, and value matrices to capture cross- O V = LN(H V + FFN(H V )).
S S
modality correlations. To guide the learning of multimodal
fusion features with the image-text alignment, we employ a
gating function on the ITM representation H M to obtain a soft Analogously, the text-centered ITM-aware co-attention net-
weight distribution αM , which is used to adjust the multi-head work takes the text representation H T as queries, and the
attention vector via element-wise multiplication. Formally, the image representation H V as keys and values, and outputs the
output features H C of the ITM-aware co-attention mechanism multimodal fusion features O T focused on text.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5
⋅ #)
×
("
Add & Norm
Attention
ITM-aware %
map
Co-attention
MatMul & Softmax Linear
!% !# !$
ITM-aware
Co-attention Unit
!% !# !$
Fig. 3: The architecture of the ITM-aware co-attention unit.
T(V)→V(T)
C. Fake News Classifier The mutual learning loss LKL for regularizing the
vision (text) centered co-attention network to imitate the text
For the obtained multimodal fusion features, we exploit a (vision) centered co-attention network can be denoted as:
fully connected layer followed by a softmax function to predict
T(V)→V(T)
the authenticity of news: LKL = DKL (P V(T) ||P T(V) ). (11)
P T(V) = softmax(W T(V) O T(V) + bT(V) ), (8) The final objective function for MMCAN becomes:
where P T(V) denotes the predicted probabilities based on L = LTC + LV T→V V→T
C + λKL (LKL + LKL ), (12)
text(vision)-centered multimodal fusion features O T(V) . We
where λKL is used to balance the classification and mutual
employ cross entropy to calculate the classification loss:
learning losses. For the inference stage, we will average the
N prediction probabilities of the text- and vision-centered fake
T(V) T(V)
X
LC = − [yi log(Pi ) news classifiers to obtain the final prediction probabilities.
i=1
T(V)
+ (1 − yi )log(1 − Pi )], (9) IV. E XPERIMENTS
where N is the number of news, yi and Pi
T(V)
respectively In this section, we evaluate the effectiveness of our MM-
denote the ground-truth label and the predicted probability of CAN.
the i-th news based on the text(vision)-centered co-attention
network. A. Experimental Settings
TABLE II: Performance comparison among different models on three datasets. The best results are in bold. We mark the results of MMCAN
with ∗ if they exceed the results of the strong baseline MFAN with statistical significance (p<0.05). † denotes our reproduced results based
on the officially released codes and the rest of the baseline results are directly taken from the papers.
Fake News Real News
Datasets Models Accuracy
Precision Recall F1 Precision Recall F1
GRU 0.702 0.671 0.794 0.727 0.747 0.609 0.671
ViLT 0.832 0.831 0.837 0.834 0.834 0.827 0.830
MVAE 0.824 0.854 0.769 0.809 0.802 0.879 0.870
SpotFake+ 0.870 0.887 0.849 0.868 0.855 0.892 0.873
SAFE 0.763 0.833 0.659 0.736 0.717 0.868 0.785
Weibo CAFE 0.840 0.855 0.830 0.842 0.825 0.851 0.837
HMCAN 0.885 0.920 0.845 0.881 0.856 0.926 0.890
MCAN 0.899 0.913 0.889 0.901 0.884 0.909 0.897
MFAN† 0.891 0.942 0.835 0.885 0.850 0.948 0.896
MMCAN-Res 0.906 0.916 0.897 0.906 0.898 0.916 0.907
MMCAN 0.911∗ 0.913 0.910 0.912∗ 0.909 0.912 0.911∗
GRU 0.634 0.581 0.812 0.677 0.758 0.502 0.604
ViLT 0.759 0.767 0.393 0.520 0.757 0.941 0.839
MVAE 0.745 0.801 0.719 0.758 0.689 0.777 0.730
SpotFake+ 0.790 0.793 0.827 0.810 0.786 0.747 0.766
SAFE 0.766 0.777 0.794 0.786 0.752 0.731 0.742
Twitter CAFE 0.806 0.807 0.799 0.803 0.805 0.813 0.809
HMCAN 0.897 0.971 0.801 0.878 0.853 0.979 0.912
MCAN 0.809 0.889 0.765 0.822 0.732 0.871 0.795
MFAN† 0.925 0.835 0.965 0.896 0.981 0.906 0.942
MMCAN-Res 0.933 0.861 0.950 0.903 0.974 0.924 0.948
MMCAN 0.943∗ 0.869 0.976 0.919∗ 0.987 0.927 0.956∗
GRU 0.832 0.782 0.712 0.745 0.855 0.896 0.865
ViLT 0.821 0.659 0.815 0.729 0.914 0.824 0.867
MVAE 0.852 0.806 0.719 0.760 0.871 0.917 0.893
SpotFake+ 0.800 0.730 0.668 0.697 0.832 0.869 0.850
SAFE 0.811 0.827 0.559 0.667 0.806 0.940 0.866
Pheme CAFE† 0.861 0.812 0.645 0.719 0.875 0.943 0.908
HMCAN 0.881 0.830 0.838 0.834 0.910 0.905 0.907
MCAN† 0.865 0.790 0.680 0.731 0.887 0.933 0.910
MFAN† 0.888 0.771 0.846 0.807 0.939 0.905 0.922
MMCAN-Res 0.890 0.803 0.794 0.799 0.922 0.926 0.924
MMCAN 0.903∗ 0.855 0.777 0.814∗ 0.918 0.950 0.934∗
social media. Each tweet in the dataset involves the text, image ImageNet to get the patch embeddings with 768 dimensions.
and social context information. The Pheme dataset [39] is The number of attention heads h in MMCAN is 8. We use
collected based on 5 breaking news, and each news contains AdamW [40] optimizer with an initial learning rate of 0.001
a set of claims. and weight decay of 0.01 for model optimization. The model is
Table I shows the detailed statistics of the three benchmark trained for 80 epochs with early stopping to prevent overfitting.
datasets. In addition, the Weibo dataset contains 9,528 unique We empirically set the batch size as 64 and the trade-off
images, the Twitter dataset contains 514 unique images and hyper-parameters λKL =0.01. To mitigate overfitting, we apply
the Pheme dataset contains 3,670 unique images. dropout with the rate as 0.4. It should be noticed that in our
We adopt the data preprocessing and split the data into model, the original parameters of ViLT are frozen, and other
training and testing sets as the same with [4], [24]. If the parameters are trainable and initialized randomly. For the news
data split of certain baselines is inconsistent with ours or the pieces containing multiple images (more than one image), we
experimental results on certain datasets are not given in the follow [11] to randomly select one image. Note that ViLT
original papers, we reproduce the corresponding experimental takes English text as input, thus we pre-translate the text in
results according to their released codes. the Weibo dataset into English via the googletrans library2
Implementation Details. when applying ViLT.
We use the pre-trained BERT [32] to get the word embed- Evaluation Metrics. We employ Accuracy as the evaluation
dings with 768 dimensions. For the input image, we resize
it to 224 × 224 and employ ViT-B/16 [36] pre-trained on 2 https://github.com/ssut/py-googletrans.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
TABLE III: Performance comparison among different variants of MMCAN on Weibo, Twitter and Pheme Datasets. The best results are in
bold.
Fake news Real news
Datasets Models Accuracy
Precision Recall F1 Precision Recall F1
MMCAN-w/o-Match 0.901 0.907 0.895 0.901 0.895 0.907 0.901
MMCAN-w/-T 0.900 0.910 0.891 0.900 0.891 0.910 0.901
MMCAN-w/-V 0.897 0.897 0.899 0.898 0.897 0.895 0.896
Weibo
MMCAN-Concat 0.908 0.907 0.911 0.909 0.909 0.906 0.907
MMCAN-Avg 0.902 0.934 0.866 0.899 0.874 0.938 0.905
MMCAN 0.911 0.913 0.910 0.912 0.909 0.912 0.911
MMCAN-w/o-Match 0.921 0.838 0.946 0.888 0.971 0.909 0.939
MMCAN-w/-T 0.913 0.813 0.958 0.880 0.977 0.891 0.932
MMCAN-w/-V 0.919 0.841 0.929 0.883 0.963 0.913 0.937
Twitter
MMCAN-Concat 0.931 0.838 0.980 0.903 0.989 0.906 0.946
MMCAN-Avg 0.929 0.852 0.952 0.899 0.975 0.918 0.945
MMCAN 0.943 0.869 0.976 0.919 0.987 0.927 0.956
MMCAN-w/o-Match 0.893 0.859 0.731 0.790 0.903 0.954 0.928
MMCAN-w/-T 0.884 0.792 0.783 0.787 0.918 0.922 0.920
MMCAN-w/-V 0.881 0.846 0.691 0.761 0.890 0.952 0.920
Pheme
MMCAN-Concat 0.896 0.843 0.766 0.802 0.914 0.946 0.930
MMCAN-Avg 0.888 0.851 0.720 0.780 0.900 0.952 0.925
MMCAN 0.903 0.855 0.777 0.814 0.918 0.950 0.934
metric for the fake news detection task. Considering the effect 8) MFAN [8], which integrates textual, visual, and social
of label distribution imbalance, we also report the Precision, network features through co-attention mechanism for fake
Recall, and F1 score of all the models for both fake news and news detection.
real news following previous works [4], [6], [24]. 9) MMCAN-Res, our MMCAN replacing the ViT encoder
with the same ResNet-50 [42] as HMCAN.
B. Baselines
We compare our MMCAN model with both unimodal and C. Results and Analysis
multimodal content based models as follows. We run MMCAN with 5 random seeds and report the
Unimodal Models. We compare MMCAN with GRU [17], average performance in Table II. From the table, we can
which exploits the multilayer GRU network to encode the obtain the following observations: 1) The proposed MMCAN
textual information for fake news detection. outperforms all the baseline models on all the datasets in terms
Multimodal Models. We compare our MMCAN with the of accuracy and most other metrics. Compared to the best
following multimodal content based methods: baseline model, MMCAN improves the accuracy by about
1) ViLT [12], which is a multimodal pre-trained model. As 1.2%, 1.8% and 1.5% on Weibo, Twitter and Pheme datasets,
a baseline method for fake news detection, we add a respectively. 2) Models that consider multimodal information
classification head on ViLT and fine-tune the head. outperform unimodal models, which confirms the advantage
2) MVAE [4], which employs a variational autoencoder of integrating multiple modality information in fake news
coupled with a binary classifier for fake news detection. detection. 3) Compared with the models which concatenate
3) SpotFake+ [3], which uses the pre-trained XLNet [41] multimodal features or leverage auxiliary tasks (i.e., Spot-
and VGG-19 to learn textual and visual features, and Fake+ and MVAE) and the multimodal pre-trained model
concatenates them for fake news detection. ViLT, models based on co-attention mechanism (i.e., HMCAN
4) SAFE [9], which jointly exploits the multimodal features and MCAN) perform better, indicating that the co-attention
and cross-modal similarity of news content for fake news mechanism can better fuse multimodal information. 4) Our
detection. MMCAN further outperforms the co-attention based methods
5) CAFE [11], which learns the cross-modal ambiguity and (i.e., HMCAN, MCAN and MFAN). Compared with methods
uses it to adaptively aggregate multimodal features and that consider the consistency of textual and visual content
unimodal features for fake news detection. (i.e., SAFE and CAFE), our MMCAN achieves substantial
6) HMCAN [24], which employs a hierarchical multimodal improvements. We believe that our proposed multimodal
contextual attention network for fake news detection by matching-aware co-attention networks with mutual knowledge
jointly modeling the multimodal context information and distillation can learn better multimodal fusion features guided
the hierarchical semantics of text. by the image-text alignment and enable the two co-attention
7) MCAN [6], which employs multiple co-attention layers networks respectively centered on text and image to learn from
to fuse the multimodal features for fake news detection. each other for collaboratively improving fake news detection.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8
5) Our model variant MMCAN-Res which replaces the ViT prediction probabilities instead of using the mutual learn-
encoder with ResNet still outperforms all the other baseline ing mechanism.
models in terms of accuracy and most other metrics. This
demonstrates our model indeed benefits from the ITM-aware Table III shows the results of ablation studies. We can
co-attention networks with mutual learning. obtain the following observations: 1) MMCAN consistently
outperforms MMCAN-w/o-Match on three datasets, which
proves the necessity of considering the image-text alignment
D. Ablation Study
in the co-attention mechanism for learning multimodal fusion
To verify the importance of each module in our MMCAN, features. 2) Both MMCAN-w/-T and MMCAN-w/-V achieve
we compare MMCAN with the following variants: low performance, demonstrating that individual text- or vision-
1) MMCAN-w/o-Match, a variant of MMCAN that re- centered co-attention network is suboptimal for fake news
places the ITM-aware co-attention mechanism with the detection. On Weibo dataset, MMCAN-w/-T performs much
traditional co-attention mechanism. better than MMCAN-w/-V. The reason could be that the text
2) MMCAN-w/-T, a variant of MMCAN based on only the on Weibo is relatively long, containing more information for
co-attention network centered on text. fake news detection [6]. 3) MMCAN-Concat, which con-
3) MMCAN-w/-V, a variant of MMCAN based on only the catenates the text- and vision-centered multimodal features
co-attention network centered on image. for classification, outperforms MMCAN-w/-T and MMCAN-
4) MMCAN-Concat, a variant of MMCAN that con- w/-V with only text- or vision-centered multimodal features.
catenates the output features of MMCAN-w/-T and Nevertheless, MMCAN-Concat achieves worse performance
MMCAN-w/-V for classification. than MMCAN, demonstrating that the mutual learning en-
5) MMCAN-Avg, a variant of MMCAN that ensembles abling knowledge distillation between the text- and vision-
MMCAN-w/-T and MMCAN-w/-V by averaging their centered co-attention networks improves the fake news de-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9
The Milky Way and the solar eclipse seen from the ISS.
Example 2
tection. 4) Compared to the ensemble method MMCAN-Avg 5, MMCAN is focusing on the facial area of the child as
which averages the prediction probabilities of the two co- the corresponding object, which is consistent with the word
attention networks that are respectively centered on image and ‘Refugee’, while it is scattered all over the image in MMCAN-
text, MMCAN enabling mutual knowledge distillation between w/o-Match. In Example 2, we find that our MMCAN with
them achieves significant improvements. the ITM-aware co-attention mechanism better aligns the word
‘eclipse’ with the luminous, circular object in the image. These
cases demonstrate that considering image-text alignment in
E. Impact of the Value of λKL
co-attention mechanism can better capture the inter-modality
To explore the impact of the balance factor λKL on the correlations.
model performance, we vary λKL from 5e-5 to 0.5, and report
the accuracy and average F1 score (the average F1 score of
fake news and real news) on the three datasets in Figure 4. We V. C ONCLUSION
observe that the accuracy and average F1 score of MMCAN on In this paper, we present novel multimodal matching-aware
three datasets generally first grow as the λKL increases, and co-attention networks with mutual knowledge distillation for
then begin to drop after λKL is larger than a certain value. fake news detection. The ITM-aware co-attention mechanism
MMCAN achieves the highest value at λKL =0.005 on Twitter in MMCAN can learn better multimodal fusion features guided
dataset, and it performs best when λKL =0.01 on both Weibo by the image-text alignment. In addition, MMCAN enables
and Pheme datasets. The best values of λKL are relatively mutual knowledge distillation between co-attention networks
small. We think the reason could be that λKL works like respectively focused on text and image for collaboratively
regularization coefficient to penalize the inconsistency of the improving the fake news detection. Extensive experiments on
predictions of two co-attention networks. three public benchmark datasets demonstrate the effectiveness
of MMCAN. In future, we plan to utilize external knowledge
F. Case Study for improving fake news detection and explore the mutual
learning between different views including the knowledge-
To gain an intuitive understanding of the ITM-aware co- guided view.
attention mechanism, we visualize the word attention weights
over the image patches calculated by Equation (5) in the text-
centered co-attention network. For ease of illustration, we R EFERENCES
reflect the attention weights on the opacity of the patches. If [1] Z. Jin, J. Cao, H. Guo, Y. Zhang, and J. Luo, “Multimodal fusion
the attention value is larger than the median attention weight, with recurrent neural networks for rumor detection on microblogs,” in
the opacity is set as 255; otherwise, it is set as 76. We Proceedings of the ACM on Multimedia Conference, 2017, pp. 795–816.
[2] S. Singhal, R. R. Shah, T. Chakraborty, P. Kumaraguru, and S. Satoh,
visualize the results of MMCAN-w/o-Match and MMCAN “Spotfake: A multi-modal framework for fake news detection,” in Fifth
in Figure 5. As can be seen from Example 1 in Figure International Conference on Multimedia Big Data, 2019, pp. 39–47.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
[3] S. Singhal, A. Kabra, M. Sharma, R. R. Shah, T. Chakraborty, and [27] J. Wang, J. Li, Y. Shi, J. Lai, and X. Tan, “Am3 net: Adaptive mutual-
P. Kumaraguru, “Spotfake+: A multimodal framework for fake news learning-based multimodal data fusion network,” IEEE Trans. Circuits
detection via transfer learning (student abstract),” in The Conference on Syst. Video Technol., pp. 5411–5426, 2022.
Artificial Intelligence, 2020, pp. 13 915–13 916. [28] Z. Wei et al., “Cross-modal knowledge distillation in multi-modal fake
[4] D. Khattar, J. S. Goud, M. Gupta, and V. Varma, “MVAE: multimodal news detection,” in International Conference on Acoustics, Speech and
variational autoencoder for fake news detection,” in The World Wide Signal Processing, 2022, pp. 4733–4737.
Web Conference, 2019, pp. 2915–2921. [29] Y. Zhao and J. Kong, “Mutual learning and feature fusion siamese
[5] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, and J. Gao, networks for visual object tracking,” IEEE Trans. Circuits Syst. Video
“EANN: event adversarial neural networks for multi-modal fake news Technol., pp. 3154–3167, 2021.
detection,” in Proceedings of the ACM SIGKDD International Confer- [30] D. Zhang, Z. Zhang, Y. Ju, C. Wang, Y. Xie, and Y. Qu, “Dual
ence on Knowledge Discovery & Data Mining, 2018, pp. 849–857. mutual learning for cross-modality person re-identification,” IEEE Trans.
[6] Y. Wu, P. Zhan, Y. Zhang, L. Wang, and Z. Xu, “Multimodal fusion Circuits Syst. Video Technol., pp. 5361–5373, 2022.
with co-attention networks for fake news detection,” in Findings of the [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Association for Computational Linguistics, 2021, pp. 2560–2569. L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
[7] P. Qi, J. Cao, X. Li, H. Liu, Q. Sheng, X. Mi, Q. He, Y. Lv, C. Guo, in Neural Information Processing Systems 30 on Neural Information
and Y. Yu, “Improving fake news detection by using an entity-enhanced Processing Systems, 2017, pp. 5998–6008.
framework to fuse diverse multimodal clues,” in ACM Multimedia [32] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of
Conference. ACM, 2021, pp. 1212–1220. deep bidirectional transformers for language understanding,” in Proceed-
[8] J. Zheng et al., “MFAN: multi-modal feature-enhanced attention net- ings of the Conference of the North American Chapter of the Association
works for rumor detection,” in Proceedings of the Thirty-First Interna- for Computational Linguistics: Human Language Technologies, 2019,
tional Joint Conference on Artificial Intelligence, 2022, pp. 2413–2419. pp. 4171–4186.
[9] X. Zhou et al., “SAFE: similarity-aware multi-modal fake news detec- [33] Y. Wang, Y. Qiu, P. Cheng, and J. Zhang, “Hybrid cnn-transformer
tion,” in Advances in Knowledge Discovery and Data Mining, H. W. features for visual place recognition,” IEEE Trans. Circuits Syst. Video
Lauw, R. C. Wong, A. Ntoulas, E. Lim, S. Ng, and S. J. Pan, Eds., Technol., pp. 1109–1122, 2023.
2020, pp. 354–367. [34] M. Cao, Y. Fan, Y. Zhang, J. Wang, and Y. Yang, “VDTR: video
[10] J. Xue et al., “Detecting fake news by exploring the consistency of deblurring with transformer,” IEEE Trans. Circuits Syst. Video Technol.,
multimodal data,” Inf. Process. Manag., p. 102610, 2021. pp. 160–171, 2023.
[11] Y. Chen et al., “Cross-modal ambiguity learning for multimodal fake [35] M. Dai, J. Hu, J. Zhuang, and E. Zheng, “A transformer-based fea-
news detection,” in The ACM Web Conference, 2022, pp. 2897–2905. ture segmentation and region alignment method for uav-view geo-
[12] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer localization,” IEEE Trans. Circuits Syst. Video Technol., pp. 4376–4389,
without convolution or region supervision,” in Proceedings of the 38th 2022.
International Conference on Machine Learning,, 2021, pp. 5583–5594. [36] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[13] N. Ruchansky, S. Seo, and Y. Liu, “CSI: A hybrid deep model for T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
fake news detection,” in Proceedings of the ACM on Conference on J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Information and Knowledge Management, 2017, pp. 797–806. Transformers for image recognition at scale,” in the 9th International
Conference on Learning Representations, 2021.
[14] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on
[37] H. Wen, X. Song, X. Yang, Y. Zhan, and L. Nie, “Comprehensive
social media: A data mining perspective,” ACM SIGKDD explorations
linguistic-visual composition network for image retrieval,” in The 44th
newsletter, pp. 22–36, 2017.
International ACM SIGIR Conference on Research and Development in
[15] C. Castillo, M. Mendoza, and B. Poblete, “Information credibility on
Information Retrieval, Virtual Event, 2021, pp. 1369–1378.
twitter,” in Proceedings of the 20th International Conference on World
[38] C. Boididou, S. Papadopoulos, D. Dang-Nguyen, G. Boato, M. Riegler,
Wide Web, 2011, pp. 675–684.
S. E. Middleton, A. Petlund, and Y. Kompatsiaris, “Verifying multimedia
[16] Y. Chen, N. J. Conroy, and V. L. Rubin, “Misleading online content: use at mediaeval 2016,” in Working Notes Proceedings of the MediaEval
Recognizing clickbait as ”false news”,” in Proceedings of the ACM Workshop, vol. 1739, 2016.
Workshop on Multimodal Deception Detection, 2015, pp. 15–19. [39] A. Zubiaga, M. Liakata, and R. Procter, “Exploiting context for rumour
[17] J. Ma, W. Gao, P. Mitra, S. Kwon, B. J. Jansen, K. Wong, and M. Cha, detection in social media,” in Social Informatics - 9th International
“Detecting rumors from microblogs with recurrent neural networks,” Conference, 2017, pp. 109–123.
in Proceedings of the 25th International Joint Conference on Artificial [40] “Decoupled weight decay regularization,” in 7th International Confer-
Intelligence,, 2016, pp. 3818–3824. ence on Learning Representations,, 2019.
[18] L. Hu, T. Yang, L. Zhang, W. Zhong, D. Tang, C. Shi, N. Duan, and [41] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and
M. Zhou, “Compare to the knowledge: Graph neural fake news detection Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language
with external knowledge,” in Proceedings of the 59th Annual Meeting of understanding,” in Advances in Neural Information Processing Systems
the Association for Computational Linguistics and the 11th International 32: Annual Conference on Neural Information Processing Systems,
Joint Conference on Natural Language Processing, 2021, pp. 754–763. 2019, pp. 5754–5764.
[19] Y. Dun, K. Tu, C. Chen, C. Hou, and X. Yuan, “KAN: knowledge- [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
aware attention network for fake news detection,” in 35th Conference image recognition,” in Conference on Computer Vision and Pattern
on Artificial Intelligence, 2021, pp. 81–89. Recognition, 2016, pp. 770–778.
[20] L. Wu et al., “Category-controlled encoder-decoder for fake news
detection,” IEEE Transactions on Knowledge and Data Engineering,
pp. 1–1, 2021.
[21] Q. Liao et al., “An integrated multi-task model for fake news detection,”
IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2021.
[22] Z. Jin, J. Cao, Y. Zhang, J. Zhou, and Q. Tian, “Novel visual and
statistical image features for microblogs news verification,” IEEE Trans.
Multim., pp. 598–608, 2017.
[23] P. Qi, J. Cao, T. Yang, J. Guo, and J. Li, “Exploiting multi-domain
visual information for fake news detection,” in International Conference
on Data Mining, 2019, pp. 518–527.
[24] S. Qian, J. Wang, J. Hu, Q. Fang, and C. Xu, “Hierarchical multi-
modal contextual attention network for fake news detection,” in The
International ACM SIGIR Conference on Research and Development in
Information Retrieval, Virtual Event, 2021, pp. 153–162.
[25] P. Li et al., “Entity-oriented multi-modal alignment and fusion network
for fake news detection,” IEEE Trans. Multim., pp. 3455–3468, 2022.
[26] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual
learning,” in Conference on Computer Vision and Pattern Recognition,
2018, pp. 4320–4328.