Image captioning is a cross-modal task combining computer vision [
4,
18,
24,
55] and natural language processing [
48,
53]. Many methods [
26,
44,
50,
54,
57] have been proposed since theemergence of deep learning in recent years. Vinyals et al. [
50] proposed to use the deep
Convolutional Neural Network (CNN) encoder to extract the visual features of the image and utilized the RNN network as the language decoder to generate caption sequences. Subsequently, the spatial attention mechanism [
57] was further investigated for its application in image captioning tasks. More recently, the transformer network [
48] with a
Multi-Head Attention (MHA) mechanism has been introduced to benefit image caption generation. Huang et al. [
26] proposed attention on attention architecture, which utilized the gated attention mechanism to measure the relevance between the attention weight and the queries. Cornia et al. [
11] developed a meshed-memory transformer to exploit both low-level and high-level contributions of the visual feature for caption generation. Yan et al. [
58] integrated a task-adaptive attention module into the transformer-based model, enabling it to identify task-specific clues and reduce misleading information from improper key-value pairs. Li et al. [
33] developed a long short-term graph, which effectively captures short-term spatial relationships and long-term transformation dependencies of visual features for image captioning. Guo et al. [
22] introduce the normalized attention module, allowing the transformer network to incorporate the geometry structure of the input objects through geometry-aware self-attention. Pan et al. [
44] developed the unified X-linear attention block for image captioning. Luo et al. [
41] proposed a dual-level collaborative transformer for image captioning to realize the complementary advantages of the region and spatial image features. Moreover, many existing methods [
27,
61] incorporated semantic visual features, which allows the captioning model to generate the most relevant attribute words. Jiang et al. [
27] built a guiding network to learn an extra guiding vector for caption sentence generation. Deng et al. [
13] proposed a syntax-guided hierarchical attention network that incorporates visual features with semantic and syntactic information to improve the interpretability of the captioning model. Li et al. [
32] devised the transformer-based framework, which exploits the visual and semantic information simultaneously. Yao et al. [
60] proposed to utilize the
Graph Convolutional Network (GCN) to integrate semantic and spatial relationships between objects. Yang et al. [
21] proposed to integrate semantic priors into the model by exploiting a graph-based representation of both images and sentences. Dong et al. [
15] introduced dual-GCN to model the object relationships over a single image and model the relationships over similar images in the dataset for image captioning.