Improve image captioning via relation modeling

F Huang, Z Li - … 2022-2022 IEEE International Conference on …, 2022 - ieeexplore.ieee.org
F Huang, Z Li
ICASSP 2022-2022 IEEE International Conference on Acoustics …, 2022ieeexplore.ieee.org
The performance of image captioning has been significantly improved recently through deep
neural network architectures combining with attention mechanisms and reinforcement
learning optimization. Exploring visual relationships and interactions between different
objects appearing in the image, however, is far from being investigated. In this paper, we
present a novel approach that combines scene graphs with Transformer, which we call SGT,
to explicitly encode available visual relationships between detected objects. Specifically, we …
The performance of image captioning has been significantly improved recently through deep neural network architectures combining with attention mechanisms and reinforcement learning optimization. Exploring visual relationships and interactions between different objects appearing in the image, however, is far from being investigated. In this paper, we present a novel approach that combines scene graphs with Transformer, which we call SGT, to explicitly encode available visual relationships between detected objects. Specifically, we pretrain an scene graph generation model to predict graph representations for images. After that, for each graph node, a Graph Convolutional Network (GCN) is employed to acquire relationship knowledge by aggregating the information of its local neighbors. As we train the captioning model, we feed the potential relation-aware information into the Transformer to generate descriptive sentence. Experiments on the MS COCO dataset validate the superiority of our SGT model, which can realize state-of-the-art results in terms of all the standard evaluation metrics.
ieeexplore.ieee.org